Sardana Ivanova defends her PhD thesis on Language technology tools for low-resource languages

On the 23rd of March 2024, M.Sc. Sardana Ivanova defends her PhD thesis on Language technology tools for low-resource languages—five cases for Sakha, Norwegian, and Finnish. The thesis is related to research done in the Department of Computer Science and in the Computational Creativity and Data Mining group.

M.Sc. Sardana Ivanova defends her doctoral thesis "Language technology tools for low-resource languages—five cases for Sakha, Norwegian, and Finnish" on Saturday the 23rd of March 2024 at 12 o'clock in the University of Helsinki Porthania building, Auditorium PIII (Yliopistonkatu 3, 1st floor). Her opponent is Professor Veronika Laippala (University of Turku) and custos Professor  Hannu Toivonen (University of Helsinki). The defence will be held in English.

The thesis of Sardana Ivanova is a part of research done in the Department of Computer Science and in the Computational Creativity and Data Mining group at the University of Helsinki. Her supervisors have been Professor Hannu Toivonen (University of Helsinki) and Senior AI Scientist Mark Granroth-Wilding (Silo AI).

Language technology tools for low-resource languages—five cases for Sakha, Norwegian, and Finnish

This dissertation develops language technology tools for low-resource languages. It is important to ensure that low-resource languages are not left behind in the rapidly evolving digital landscape, as language technology tools can greatly improve communication and information access for speakers of these languages. The support of low-resource languages through technology development and revitalisation efforts is essential for preserving linguistic diversity and maintaining the richness of cultural heritage. 

The dissertation presents five case studies for three languages, starting from the truly low-resource Sakha language to the more resourceful languages, Finnish and Norwegian, which still lack many resources available for English. Sakha is a Turkic language spoken in the Republic of Sakha in Siberia by 0.5 million people. Finnish is a Uralic language of the Finnic branch, spoken by 5.8 million people in Finland and by ethnic Finns outside of Finland. Norwegian is a North Germanic language, spoken mainly in Norway by 5.32 million people. 

The five cases covered in the dissertation range from essential tools for Sakha, such as a morphological analyser, to higher-level tools for Norwegian and Finnish. The contributions of the dissertation are as follows. 

We developed a morphological analyser and generator for Sakha within the framework of two-level morphology. It has a coverage of above 90 % and 99 % precision. While developing the analyser, we expanded linguistic knowledge about Sakha and devised strategies for complex grammatical patterns. 

We implemented a language-learning environment for Sakha in the Revita computer-assisted language-learning platform, using the morphological analyser we developed. 

We created a Turkic Interlingua corpus and trained Russian-Sakha, Sakha-Russian, English-Sakha, and Sakha-English machine translation models, as well as a multi-way neural machine translation model. We performed an extensive analysis using automatic metrics as well as human evaluations. 

We created NorQuADthe first Norwegian question-answering dataset for machine reading comprehension. The dataset consists of 4,752 manually created question-answer pairs. We benchmarked several multilingual and Norwegian monolingual language models on the dataset and compared them against human performance. 

We developed a method for poetry writing applicable to many languages. We illustrated the method using Finnish as an example. The method involves generating poetry one line at a time using a sequence-to-sequence neural model that has been fine-tuned for this purpose.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-952-84-0105-6.

Printed copies will be available on request from Sardana Ivanova: sardana.ivanova@helsinki.fi.