Researchers teach artificial intelligence to be fluent in Finnish dialects

Finnish dialects create a lot of trouble when interacting with computers, since it is impossible to speak a language without speaking in a dialect of some sort. A research group has built artificial intelligence models that can automatically detect, normalize and generate Finnish dialects.

Collecting data for making an AI understand dialectal Finnish and Swedish has been on the news recently. Computers usually understand Finnish only as the normative standard known as kirjakieli and they need examples of spoken language to improve their algorithms. Since it is impossible to speak a language without speaking in a dialect of some sort, Finnish dialects, create a lot of trouble when interacting with computers. 

The methods devised by the research group of Mika Hämäläinen, Niko Partanen, Khalid Alnajjar and Jack Rueter from the University of Helsinki take this further and enable an AI to be fluent in the Finnish dialects. The results were published in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing by The Association for Computational Linguistics.

Computers can speak in 23 subdialects

Within the paradigm of computational creativity, they have developed a method for converting standard Finnish into one of the 23 Finnish subdialects. Computers should not only be able to understand dialectal Finnish, but they should also be able to express themselves in a dialect.

– With our method, an intelligent system such as a robot can say akku on lopussa (battery is low), for example in Etelä-Karjala dialect akku o lopussa, Etelä-Satakunta dialect akku ol lopus or Länsi-Uusimaa dialect akku o lopus, Hämäläinen says.

For example, the commonly used algorithm of Google Translate fails to translate a dialectal Finnish sentence Oisko sulla jotai esimerkkei siit (Do you happen to have some examples of that) producing a completely incorrect “English” translation Oisko sulla something like that just because Google Translate has been built to work exclusively on standard Finnish. This same phenomenon can be observed with any AI tools that support Finnish like Apple Siri or dictation in macOS.

Dialects are detected from both spoken audio and text

The research shows that detecting dialects is a difficult task when relying on plain text. Dialect identification is easier when the model has access to audio as well because many dialects are marked with distinctive phonetic properties. Thus the latest research published by the researchers deals with detecting dialects from both spoken audio and text.

– The process of normalizing dialects to standard text has many benefits. It allows analyzing dialectal materials using tools for the Standard Finnish, and we can also use the normalized version as a search item when we want to find something from the dialectal materials, says Khalid Alnajjar.

The researchers remind that the problem of understanding dialects is complex and no model can understand natural language like humans do. But the created models open many more interesting directions for research, such as the degree to which a dialect deviates from the norm and what are the syntactic differences between different language varieties.

– With this we can improve the current state of Finnish natural language processing solutions and build AI models tailored for individuals. For example, we have already reached impressive results in speech recognition of one person’s speech, even in endangered languages, Niko Partanen says.

The research group has also developed a similar normalization methodology for the dialects of Swedish spoken in Finland (Hämäläinen et al., 2020b) and historical Finnish (Hämäläinen et al., 2021b).

The dialect generator can be tested online and the dialect normalizer and generator code have been released openly on Github. The dialect identification code can be found on Github as well.

Research papers

Partanen, N., Hämäläinen, M., & Alnajjar, K. (2019). Dialect Text Normalization to Normative Standard Finnish. In W. Xu, A. Ritter, T. Baldwin, & A. Rahimi (Eds.), The Fifth Workshop on Noisy User-generated Text (W-NUT 2019): Proceedings of the Workshop (pp. 141–146). The Association for Computational Linguistics.

Hämäläinen, M., Partanen, N., Alnajjar, K., Rueter, J., & Poibeau, T. (2020a). Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity. In F. A. Cardoso, P. Machado, T. Veale, & J. M. Cunha (Eds.), Proceedings of the 11th International Conference on Computational Creativity (ICCC’20) (pp. 204-211). Association for Computational Creativity.

Hämäläinen, M., Partanen, N., & Alnajjar, K. (2020b). Normalization of Different Swedish Dialects Spoken in Finland. In GeoHumanities'20: Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities (pp. 24–27). ACM.

Hämäläinen, M., Alnajjar, K., Partanen, N., & Rueter, J. (2021a). Finnish Dialect Identification: The Effect of Audio and Text. In M-F. Moens, X. Huang, L. Specia, & S. Wen-tau Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8777-8783). The Association for Computational Linguistics.

Hämäläinen, M., Partanen, N., & Alnajjar, K. (2021b). Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography. In P. Denis [et al.] (Ed.), Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles (pp. 189-198). Association pour le Traitement Automatique des Langues.