In the Language Bank: Tommi Jauhiainen

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Tommi Jauhiainen works as a Project Planning Officer in Kielipankki and he is currently starting his two-year post doc. Here, Tommi tells us about his research related to some language resources in Kielipankki.

Who are you?

I am Tommi Jauhiainen and at the moment, I work as a Project Planning Officer in Kielipankki. From the beginning of year 2021, I will start as a post doc researcher on a grant from the .

What is your research topic?

During the past ten years, my research has focused in language identification of text. On this topic, I completed my in 2010 and my in 2019. Language identification refers to the comparison of a text written in an unknown language to a set of given languages. A similar method can also be used to classify texts by subject area, for example.

The difficulty of language identification varies greatly depending on the situation. The task is easy in case there are only a few clearly different languages to choose from, such as Finnish and Swedish, and if the texts are reasonably long, for example several sentences. In case there are hundreds of languages to choose from, if the languages are close to each other (e.g. Kven and Meänkieli) and/or if the texts are short (e.g. single words only), it may be very difficult to identify the language.

Last year, our extensive of automatic language identification in texts was published in the Journal of Artificial Intelligence. We are also currently working on a textbook on the same topic. The book is expected to be published in “Synthesis Lectures on Human Language Technologies” series by Morgan & Claypool in late 2021.

During and after my PhD research, I have participated in several international shared tasks that have focused on distinguishing between very close languages or dialects. In 2018, we won the focusing on Swiss German dialects and Indo-Aryan languages, and last year we won a focusing on different versions of Mandarin Chinese. I am also a member of the "" Centre of Excellence, in which context I have studied how cuneiform texts written in different dialects of Akkadian and Sumerian could be distinguished from one another. I organized an international shared task on this topic last year, and the winner was a .

In the forthcoming “Language Identification of Speech and Text” project, funded by the Finnish Research Impact Foundation, I will move towards the study of language identification in speech, in addition to text. Until now, the research fields of speech and text language identification have been relatively separate from each other, and my intention is to bring more collaboration between them.

How is your research related to Kielipankki?

Most of my PhD research was done in the project, which was part of the FIN-CLARIN research group that maintains Kielipankki. In the project, we searched the Internet for websites written in small Uralic languages, created a portal site for them, and compiled sentence corpora from the texts they contained. During the processes of harvesting the web and creating the sentence corpora, we used automatic language recognition as part of the workflow. The portal site, , is now part of the tools maintained by Kielipankki and the corpora can be found in Kielipankki in three different versions. The Wanca 2017 corpora is being used in the ongoing (Uralic Language Identification) shared task and the corpora will be published next year.

Publications related to Kielipankki:

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2015). . In First International Workshop on Computational Linguistics for Uralic Languages: Proceedings of the Workshop (Vol. 2, pp. 87–98). (Septentrio Conference Series; Vol. 2015, No. 2). Septentrio Academic Publishing.

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2015). . In Computational Linguistics and Intelligent Text Processing (Vol. Part I, pp. 633-643). (Lecture Notes in Computer Science; Vol. 9041). Springer International Publishing AG.

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2016). . In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects: VarDial3, Osaka, Japan, December 12 2016 (pp. 153-162).

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2017). . In 21st Nordic Conference of Computational Linguistics: Proceedings of the Conference (pp. 183-191). (Linkping Electronic Conference Proceedings; No. 31). Linköping University Electronic Press.

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). . In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 66-75). The Association for Computational Linguistics.

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2018). . In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018) (pp. 254-262). The Association for Computational Linguistics.

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2019). Wanca in Korp: Text corpora for underresourced Uralic languages. In Proceedings of the Research data and humanities (RDHUM) 2019 conference : data, methods and tools (pp. 21-40). (). University of Oulu.

Jauhiainen, T., Linden, K., & Jauhiainen, H. (2019). , 25(5), 561-583. [135132491900038].

Jauhiainen, T. (2019). . University of Helsinki.

Jauhiainen, T., Jauhiainen, H., Alstola, T., & Linden, K. (2019). . In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 89-98). The Association for Computational Linguistics.

Jauhiainen, T., Jauhiainen, H., & Linden, K. (2019). . In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2019) (pp. 178-187). The Association for Computational Linguistics.

Jauhiainen, T., Lui, M., Zampieri, M., Baldwin, T., & Lindén, K. (2019). . Journal of Artificial Intelligence Research, 65, 675-782.

Zampieri, M., Malmasi, S., Scherrer, Y., Samardžic, T., Tyers, F., Silfverberg, M. P., Klyueva, N., Pan, T-L., Huang, C-R., Ionescu, R. T., Butnaru, A., & Jauhiainen, T. S. (2019). . In Proceedings of the (pp. 1-16). The Association for Computational Linguistics.

Jauhiainen, H., Jauhiainen, T., & Linden, K. (2020). . In Proceedings of the 12th Web as Corpus Workshop (pp. 23-32). The Association for Computational Linguistics.

Gaman, M., Hovy, D., Ionescu, R. T., Jauhiainen, H., Jauhiainen, T., Linden, K., Ljubešić, N., Partanen, N., Purschke, C., Scherrer, Y., & Zampieri, M. (Accepted/In press). A Report on the VarDial Evaluation Campaign 2020.

Jauhiainen, T., Jauhiainen, H., Partanen, N., & Linden, K. (Accepted/In press). . In Proceedings of VarDial 2020

Lindgren, M., Jauhiainen, T., & Kurimo, M. (2020). . In Proceedings of Interspeech 2020 (pp. 467-471)

The consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. is the collection of services that provides the language materials and tools for the research community.

10.12.2020

Tommi Jauhiainen

News

Culture

Share this page

Newsletter