In the Language Bank: Marja-Liisa Helasvuo
Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Marja-Liisa Helasvuo tells us about the digital language resources that have been compiled at the University of Turku. The collaboration has now evolved into a full-scale infrastructure of language data and resources.

Who are you?

I am Marja-Liisa Helasvuo, professor of Finnish language at the University of Turku. I studied Finnish language and general linguistics at the University of Helsinki, and I did my PhD in linguistics at the University of California, Santa Barbara. I have always been particularly interested in spoken language, and in my doctoral thesis I examined spoken Finnish from a crosslinguistic perspective.

What is your research topic?

My research has focused on grammar and human interaction. I have investigated a wide variety of data: everyday conversations between adults or between adults and children, online conversations, and other computer-mediated interactions. I have also studied written texts, from the oldest Finnish texts to more recent ones. I have explored a wide range of grammatical topics with the help of these resources.

I work at the Department of Finnish and Finno-Ugric Languages at the University of Turku. We have produced several digital corpora, starting from The Finnish Dialect Corpus of the Syntax Archive, whose compilation began in 1967. It is the first Finnish language corpus that has been directly compiled into a machine-readable format.

Since the Dialect Corpus, several others have followed: the Agricola Corpus, which contains all the works of Mikael Agricola from the 16th century, the Advanced Finnish Learners’ Corpus (LAS2) and the Corpus of Academic Finnish (LAS1). These are all grammatically coded and they are available in Kielipankki – the Language Bank of Finland (LAS1 will be available soon). In addition, we have produced several resources for Finno-Ugric languages. These materials have been collected in the Archive of Finnish and Finno-Ugric Languages. As we have produced many language resources in our organization, we also have many researchers who are interested in conducting corpus-based research. It’s always easy to ask a colleague for assistance when figuring out which corpus to use to study a particular topic.

Recently, we have been increasingly collaborating with the TurkuNLP research group. We established the UTU-Digilang infrastructure, which includes not only the Archive of Finnish and Finno-Ugric Languages, but also the Digilang portal, the Digilang longterm storage, and the TurkuNLP research group with its language resources and data tools. This collaboration has been very rewarding and I have learned a lot from it. I would like to see more collaboration of this kind in the future as well.

How is your research related to Kielipankki?

I have used language corpora in almost all my research. Many of these resources are available in Kielipankki.

I have been working on the ArkiSyn Corpus, which is available in Kielipankki. We received funding for the project from the Kone Foundation, which helped us to build a morphosyntactically annotated corpus. You can easily search it for all occurrences of a given word (e.g. all forms of the verb ajatella, ’think’) or all occurrences of a given grammatical form (e.g. all forms of the past tense).

Recently, my research has focused in particular on different kinds of fixed expressions, which occur frequently and mostly in the same form. For example, the verb ajatella ’think’ is a very common verb in everyday Finnish conversation. It almost always occurs in the 1st person singular and the tense of the expression is the past tense (ajattelin ’I thought’). When we compared the results of the corpus search with the corresponding passages in the audio recordings, we found that although the expressions were transcribed as ’I thought’, they were in fact phonetically quite eroded. In most cases, the expression occurred in the form maattet. The first person singular pronoun minä ‘I’ was reduced to the m sound at the beginning, the first and second syllable of the verb ’think’ (ajat) were fused together (aat). The reduced form of the word että ’that’ had stuck at the end. This type of phonetic reduction and crystallization of usage into a particular form is very common in fixed expressions.

In addition to ArkiSyn, I have also used the Suomi24 Corpus, the Agricola Corpus, The Finnish Dialect Corpus of the Syntax Archive and newspaper materials. The different corpora allow for different research topics.


Laury, Ritva, Marja-Liisa Helasvuo & Janica Rauma 2020. “When an expression becomes fixed: mä ajattelin että ‘I thought that’ in spoken Finnish”. – Ritva Laury & Tsuyoshi Ono (eds.), Fixed Expressions: Building language structure and social action, pp. 133–166. Pragmatics & Beyond New Series 315. Amsterdam: John Benjamins.

Helasvuo, Marja-Liisa 2019. “Free NPs as units”. Special issue “On the Notion of Unit in the Study of Human Languages”, guest editors Tsuyoshi Ono, Ritva Laury & Ryoko Suzuki. Studies in Language 43:2:301–328.

Laury, Ritva & Marja-Liisa Helasvuo 2016. “Disclaiming epistemic access with ‘know’ and ‘remember’ in Finnish”. Special Issue on “Grammar and negative epistemics in talk-in-interaction”, guest editors Jan Lindström, Yael Maschler and Simona Pekarek Doehler. Journal of Pragmatics 106 (2016): 80–96.

Helasvuo, Marja-Liisa & Aki-Juhani Kyröläinen 2016. “Choosing between zero and pronominal subject: Modeling subject expression in the 1st person singular in Finnish conversation”. Corpus Linguistics and Linguistic Theory 12(2):263–299.

More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.