02.12.2021 change 02.12.2021

National Information Processing Institute to create neural language models in Polish

Credit: Adobe Stock

The National Information Processing Institute develops neural language models used to detect spam and antiplagiarism systems. Two new models have been launched this year: Polish RoBERTa v2 and GPT-2 intended for tasks related to the text generation.

The popularity of neural language models has grown significantly over the past few years. Their size (number of parameters) also increases rapidly. They are widely used, yet few people are aware of this. Thanks to them though, Internet users have access to text translation into different languages, it is possible to detect spam, conduct online research on social moods, use automatic text correction and talk to chatbots.

Work on the development of neural language models continue in many IT centres and companies around the world. The IT industry has long seen their potential. They are increasingly useful in every Internet user's life. Developing new neural models, however, requires high computational power and specialist infrastructure. Individuals or small organizations are not capable of training them. In addition, large amounts of data are necessary. Just like with other tools based on artificial intelligence (AI), the greater the data set used to train the model, the more precise the model will be.

Most of these models, however, are developed for the English language. That is why researchers from the National Information Processing Institute develop and share Polish language models. This year, they added two more: Polish RoBERTa v2 and GPT-2.

According to the National Information Processing Institute: “The models can be used, for example, for research on the detection and classification of hate in social media or fake news. Models in Polish are essential for analysing the Polish Internet, it is not possible to analyse the data of Polish phenomena using foreign language tools.”

The base part of the models data pool are high-quality texts (Wikipedia, documents of the Polish Parliament, social media content, books, articles, longer written forms). On the other hand, the online part of dataset consists of filtered and properly cleaned extracts from websites (the CommonCrawl project).

Sławomir Dadas, deputy head of the Laboratory of Intelligent Information Systems at the National Information Processing Institute said: “The models made available by the National Information Processing Institute are based on transformer networks. This architecture is relatively new, in use since 2017. Transformer networks do not rely on sequential data processing, instead they process data in a simultaneous manner.”

Training one model takes approx. 3-4 months. All neural language models developed at the National Information Processing Institute have been tested with the comprehensive list of language evaluations (KLEJ benchmark) developed by Allegro. It makes it possible to evaluate the model based on nine tasks, such as, for example, sentiment analysis or the analysis of semantic similarity of texts. (PAP)

uka/ zan/ kap/

tr. RL