In the Language Bank: Mika Hämäläinen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Mika Hämäläinen tells us about his research on computational creativity and developing language technology for endangered languages.

Who are you?

I am Mika Hämäläinen, a postdoctoral researcher at the Department of Digital Humanities at the University of Helsinki. In 2020, I finished my PhD thesis on computational creativity with the title Generating Creative Language: theories, practice and evaluation. The title describes well my research interests, as I am not only interested in the technical implementation of language technology models, but also in their relation to theories and real-world phenomena. Open source code and publishing research results as easy-to-use tools as possible are very important to me.

What is your research topic?

I have researched computational creativity as well as language technology for endangered languages and for non-standard languages such as dialects and historical language forms. Computational creativity is a challenging research topic from the perspective of Artificial Intelligence (AI), as the aim is to develop computational models that are capable of producing new creative texts such as poetry (Hämäläinen & Alnajjar, 2019) or humour (Alnajjar & Hämäläinen, 2021). A machine shouldn’t just be able to output new text, but also be able to interpret its output on some meaningful level. For this purpose, we have developed analysis tools, such as the FinMeter library, which analyses Finnish poetry. The library can be used, for example, to analyse meter and interpret metaphors.

Language technology for endangered languages is very challenging, as modern language technology increasingly relies on massive text resources that are not readily available. The corpora of endangered languages also tend to contain a lot of variation, as the languages concerned may not have been subject to the same extent of language guidance as, for example, Finnish. This kind of linguistic diversity is difficult from the perspective of machine learning: The more variation the corpus contains, the larger its size should be in order for machine learning models to cope with the variation. Language technology for endangered languages therefore requires some ingenuity. We have successfully analysed the morphology (Hämäläinen et al., 2021a), morphosyntax (Hämäläinen & Wiechetek, 2020) and cognates (Hämäläinen & Rueter, 2019) of endangered languages by generating synthetic data for machine learning models. Data from endangered languages can be easily processed using the UralicNLP library that I have developed.

Even in the case of vital languages, the abundant variation is a headache for language technologists. I have done research on the normalisation of historical English language forms (Hämäläinen et al., 2018). Normalisation simply means that a computer can convert the historical deviant orthography into a modern language. The English language normalisation tool Natas is available on GitHub. Since then, I have worked on the normalisation of Finnish (Partanen et al., 2019) and Finnish Swedish dialects (Hämäläinen et al., 2020a), as well as on the generation of Finnish dialects (Hämäläinen et al., 2020b) based on the written language. These research results have been published in the Murre library. My most recent work has been the automatic recognition of Finnish dialects based on sound and text (Hämäläinen et al., 2021b)

How is your research related to Kielipankki?

The Samples of Spoken Finnish corpus has been absolutely crucial in building dialect models. Without this corpus, my research on Finnish dialects would simply have been impossible.

The data from the Language Bank has also been useful in the study of computational creativity. For example, the Finnish WordNet has been used in my poetry generator (Hämäläinen, 2018) and Opusparcus has been useful in producing creative dialogue (Alnajjar & Hämäläinen, 2019).

Publications

Alnajjar, K., & Hämäläinen, M. (2021). When a Computer Cracks a Joke: Automated Generation of Humorous Headlines. In Proceedings of the 12th International Conference on Computational Creativity (ICCC 2021) (pp. 292-299). Association for Computational Creativity.

Hämäläinen, M., Alnajjar, K., Partanen, N., & Rueter, J. (2021b). Finnish Dialect Identification: The Effect of Audio and Text. In M-F. Moens, X. Huang, L. Specia, & S. Wen-tau Yih (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 8777-8783). The Association for Computational Linguistics.

Hämäläinen, M. (2020) Generating Creative Language: Theories, Practice and Evaluation. Helsingin yliopisto.

Alnajjar, K., & Hämäläinen, M. (2019). A Creative Dialog Generator for Fallout 4. In Proceedings of the 14th International Conference on the Foundations of Digital Games [48] ACM.

Hämäläinen, M., & Alnajjar, K. (2019). Let’s FACE it: Finnish Poetry Generation with Aesthetics and Framing. In K. V. Deemter, C. Lin, & H. Takamura (Eds.), 12th International Conference on Natural Language Generation: Proceedings of the Conference (pp. 290-300). The Association for Computational Linguistics. https://doi.org/10.18653/v1/w19-8637

Hämäläinen, M., Partanen, N., Rueter, J., & Alnajjar, K. (2021a). Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered. In S. Dobnik, & L. Øvrelid (Eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 166-177). (NEALT Proceedings Series; No. 45), (Linköping Electronic Conference Proceedings; No. 178). Linköping University Electronic Press.

Hämäläinen, M., & Rueter, J. (2019). Finding Sami Cognates with a Character-Based NMT Approach. In A. Arppe, J. Good, M. Hulden, J. Lachler, A. Palmer, L. Schwartz, & M. Silfverberg (Eds.), Proceedings of the 3rd Workshop on Computational Methods in the Study of Endangered Languages: (Volume 1) Papers (pp. 39-45). The Association for Computational Linguistics.

Hämäläinen, M., Partanen, N., & Alnajjar, K. (2020a). Normalization of Different Swedish Dialects Spoken in Finland. In GeoHumanities’20: Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities (pp. 24–27). ACM.

Hämäläinen, M., Partanen, N., Alnajjar, K., Rueter, J., & Poibeau, T. (2020b). Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity. In F. A. Cardoso, P. Machado, T. Veale, & J. M. Cunha (Eds.), Proceedings of the 11th International Conference on Computational Creativity (ICCC’20) (pp. 204-211). Association for Computational Creativity.

Hämäläinen, M., & Wiechetek, L. (2020). Morphological Disambiguation of South Sámi with FSTs and Neural Networks. In D. Beermann, L. Besacier, S. Sakti, & C. Soria (Eds.), Proceedings of the 1st Joint SLTU and CCURL Workshop (SLTU-CCURL 2020) (pp. 36-40). European Language Resources Association (ELRA).

Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., & Mäkelä, E. (2018). Normalizing early English letters to Present-day English spelling. In B. Alex, S. Degaetano-Ortlieb, A. Feldman, A. Kazantseva, N. Reiter, & S. Szpakowicz (Eds.), Proceedings of the 2nd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (pp. 87-96). (ACL Anthology; No. W18-45). The Association for Computational Linguistics.

Hämäläinen, M. (2018). Harnessing NLG to Create Finnish Poetry Automatically. In F. Pachet, A. Jordanous, & C. León (Eds.), Proceedings of the Ninth International Conference on Computational Creativity (pp. 9-15). Association for Computational Creativity (ACC)

Partanen, N., Hämäläinen, M., & Alnajjar, K. (2019). Dialect Text Normalization to Normative Standard Finnish. In W. Xu, A. Ritter, T. Baldwin, & A. Rahimi (Eds.), The Fifth Workshop on Noisy User-generated Text (W-NUT 2019): Proceedings of the Workshop (pp. 141–146). The Association for Computational Linguistics.

More information on the tools and corpora

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.