Polish Sign Language Corpus already the largest in the world

More than 300 hours of recordings and 200 thousand described signs of sign language are included in the Polish Sign Language (PSL) Corpus, developed for three years at the University of Warsaw. Right now he is the greatest collection of data on sign language in the world.

Language corpus is a collection of textual data available in electronic form, which form the material for the study of language. Scientists working on the PSL Corpus first record sign conversations and statements of the deaf, and then analyse the recordings used and describe the used signs and grammatical structures.

"Currently, we have recorded almost 80 deaf PSL users. It total, it is more than 300 hours of recordings. A collection of identified and classified components is around 200 thousand. In this respect, we now have the largest sign language corpus body in the world" - told PAP the project leader Dr. Paweł Rutkowski from the Sign Linguistics Laboratory, University of Warsaw.

Each recording session involves two deaf people who discuss topics prepared by the researchers. "We want the collected data to reflect a real, spontaneous conversation in sign language. We do not want these discussions to be staged, unnatural" - said Dr. Rutkowski.

The participants do not stick to short statement, they solve 20 linguistic tasks, and the recording of one person takes about 5 hours. "After each recording session so we have 10 hours of recordings" - the researcher told PAP.

The plan of the conversation is written out to the minute. The participants’ tasks include giving directions to places shown on the map, making an appointment, talking about a viewed part of the a to, remembering what they were doing when they heard about the 9/11 attacks, etc.

"We can not say: +show us past tense, negation and imperative in your language+. Few people, including those hearing, are able to describe the grammar. So we prepared tasks the performance of which requires references to the past, the use of negation or signing an imperative sentence. Data collected in this way are an invaluable source of knowledge about the real PSL. We see how the deaf operate space, how they describe temporal relations, how they form sentence structures. Their language is fascinating, as complex as oral languages, but quite different in terms of the grammar features" - described the scholar.

Each conversation is recorded by five cameras installed in various places in the studio. In sign language is very important not only what sign is being made, but also how far away from the body hand are set. For this reason, one of the cameras is suspended from the ceiling and records the bird’s eye view of the conversation.

"The corpus is a treasure trove of knowledge not only about the grammar of sign language, but also about the culture of the deaf. While collecting the linguistic data, we also collect information about the life of the deaf in early twenty-first century Poland. We often forget that the deaf are one of the largest linguistic minorities in Poland, with their own traditions, culture, poetry, theatre, civility, etc." - the researcher told PAP.

The study involves the deaf from all over Poland. The oldest participant is 82 years old, the youngest is 18. With such diversity, scientists will learn how the signs and structures used in Warsaw are different than those used, for example, in Wrocław, and determine how the oldest and the youngest persons sign.

"We are already know that the differences between the signing persons are much greater than in the case of the spoken Polish language. This is due to differences in education: every school for the deaf is a little different, every major city has its own PSL user community. There is no +master+ version of sign language. PJM develops spontaneously, like other natural languages" - explained Dr. Rutkowski.

Who can use the data collected in the corpus? "We have to be very careful here. This is a delicate matter" - said the scientist. The problem is that the statements in sign language can not be presented without showing the face of the signing person. Participants of the corpus recordings do not always want their image to be publicly available. "The corpus will certainly not be fully available on the internet, except for the selected samples. Most likely, data access will be possible after proving that the data will be used only for research or educational purposes, and not commercially" - explained the scientist.

20 people are working on developing the corpus, half of whom are the hearing impaired. "Those of us who can hear would never be able to describe the language as well as its users. The fact that we now have the largest corpus of sign language in the world, it is largely due to their contribution. I am extremely grateful to deaf friends for their will want to share their linguistic competence with the hearing" - said Dr. Rutkowski.

Scientists obtained funds for the first part of the project from the Foundation for Polish Science and the National Science Centre. If they manage to get additional funds, the work will be continued. "The German project of this type has been put at 20 years. It assumes recording more than 300 people. If we could record, in proportion to the Polish population, about 150 - 200 people, it would be already a huge corpus. However, it is the task for years" - noted the scientist.

