In the Language Bank: Therese Lindström Tiedemann
Therese Lindström Tiedemann tells us about her research on Swedish as a second language. There is a definite need to continue developing Finland-Swedish corpora to ensure that Finland-Swedish is also included in future studies of the Swedish language.

Who are you?

My name is Therese Lindström Tiedemann and I am a university lecturer in the Swedish Language at the University of Helsinki. In addition to the Swedish language, I also work on general linguistics. I wrote my PhD thesis on the history of grammaticalisation as a concept in linguistics, i.e. within the history of linguistics.

What is your research topic?

In recent years, most of my research has been on Swedish as a second language. In my research I often use corpus linguistic methods. Together with colleagues, I have also tried to use crowdsourcing. I also do research on other topics such as grammaticalisation, the history of linguistics, the teaching of grammar and metalinguistic knowledge.

How is your research related to Kielipankki?

I have used Kielipankki’s resources mainly in connection with my research on Swedish as a second language and in the context of teaching. For instance, I have used the Swedish subcorpus of the Topling corpus. Currently, I am managing our faculty’s part of the Digisvenska project where we are creating a text corpus from the Digital Matriculation Examination in B1-Swedish (Swedish as a second language, i.e. having been learnt from year 6 (or 7 in the old curriculum)) in Finland. We aim to study how the exam correlates to the curriculum and the fairness and transparency of the test results. Among other things, we will study how lexical breadth in the form of lexical variation (cf. vocabulary size) relates to scores and marks in the exams, but also verb conjugation and adverbial clause modifiers, as well as the linguistic accuracy in the form of how close it is to the norm.

A few years ago, I tried to study the Swedish word nog (lit. ‘enough’) using the Sinebrychoff corpus together with Jan Lindström. However, in the end the work needed to be done primarily with a more comprehensive text version of the corpus and not with the version available in Korp.

Swedish-language resources in Finland need developing

I also have a more general interest in the Swedish-language resources available in Kielipankki because of my research on Swedish and teaching students in Scandinavian languages, and since I often use corpus-based methods. This is why it is important for me to know which corpora I can recommend students to use and how they can be used. There is definitely a need to continue developing Finland-Swedish corpora to ensure that we can describe Finland-Swedish (Sw. ”finlandssvenska”) in a similar way to how we can describe Swedish as spoken in Sweden (Sw. ”sverigesvenska”), and that Finland-Swedish is also included in future studies of the Swedish language. In the Finnish context, we can also see that some corpora contain both Finnish and Swedish. There is a need to consider the best way to study how and when Swedish is used in these corpora, and whether this is representative of how Swedish is used in these contexts in Finland. This applies, for example, to the corpus of parliamentary plenary sessions (Eduskunnan täysistunnot), where Swedish words are currently only tagged as foreign words. This impedes research possibilities on this part of the data. However, at the same time, we can clearly see that Swedish words top and dominate the list of words tagged as foreign words in the plenary sessions. It would be interesting to see these parts treated as Swedish, and whether it might somehow be possible to annotate the Swedish parts as Swedish, thus facilitating the study of them from a Swedish perspective.

Besides the Swedish-language resources, I also have an interest in interoperability between different corpora and resources, transparency of research data and comparability between different sources for the Swedish language. With many of the Swedish language corpora being available via Språkbanken Text (Sweden), and with our needs to be able to compare corpora at Kielipankki with these, I see a need for information about how comparable these corpora are, and whether corpora in Kielipankki have been annotated in the same way. This is important to ensure that Finland-Swedish and other Swedish corpora located in Finland can be compared with Swedish corpora located in Sweden. This could give Finland Swedish and second language Swedish (L2 Swedish) with Finnish as the first language (L1) a clear and fair place in research on Swedish and L2 Swedish in general.

As part of my work on corpora my colleagues and I have also checked how well the automatic annotation works, especially on material produced by L2 speakers. We have checked the annotation of coursebook texts (written by L1 speakers but aimed at, or selected for, L2 learners), texts written by L2 learners and texts written by L2 speakers and ”normalised” (i.e. with standardised spelling for instance) to facilitate annotation, queries and comparisons. The results showed that texts written by learners are often not as well annotated but also not always worse. Lemmatisation, word class tagging and sense disambiguation was good enough to be used in studies of L2 Swedish, even though sense disambiguation was more problematic than the first two. There were bigger problems with dependency analysis (cf. clause analysis, parsing) and multiword expressions also proved to be problematic especially in learner writings. Still multiword annotation was good enough to allow us to conclude that we can use it in our work, although the user should know that something may have been missed and that the multiword annotation is based on the expressions which are part of the Saldo lexicon, and how they have been listed in Saldo. The results showed that sometimes there was disagreement regarding whether a preposition should be seen as part of the expression or not.

I am very happy to see that more Swedish corpora have been added to Kielipankki in the last few years. I hope that in the future there will be even more Swedish corpora added in Kielipankki and that they will be annotated as the Swedish corpora in Språkbanken Text (Sweden) and that information about the data will be made accessible in such a way that students and researchers can easily find comparable material and know how representative the material is for a certain type of language (e.g. a dialect, newspaper writings).

Recently finished projects and some future steps

In the coming years I will be working on a project on pseudonymisation of linguistic data (Mormor Karl är 27 år). Pseudonymisation means that some information such as names of people, places, etc are changed to pseudonyms in the data, when this information is such that it might reveal who wrote the text. In this project we will study how pseudonymisation affects research data in the humanities, an important step in work on open reusable data needed for reproducibility and for reduplication studies to be possible on data already collected while at the same time protecting people’s identity.

In connection to the project which I have just finished together with Elena Volodina, University of Gothenburg (L2 profiles – Development of lexical and grammatical competences in immigrant Swedish) we have released a dataset with manual morphological annotation of lexemes which are present in materials aimed at learners of Swedish as a second language or produced by speakers of Swedish as a second language (CoDeRooMor). This resource has now been updated and will be released as part of the resource Swedish L2 profiles during 2023. Swedish L2 profiles is a resource where you can search for e.g. a word, a tense, a morpheme or a word formation pattern to see how this is used at different proficiency levels (according to CEFR, the Common European Framework of Reference for Languages, Council of Europe) both in course books for Swedish as a second language and in learner essays from different CEFR-levels. The resources which we have created are part of Språkbanken Text (Sweden), but are or will be openly accessible.

I have also been involved in the development of an annotation tool in relation to research on Swedish (Legato) and in the use of the CALL platform Lärka for the teaching of syntactic functions, word classes and semantic roles. The CALL platform Lärka is something I have used in teaching grammar, which meant that I could give feedback to the developers from that perspective. Together with Volodina I have also used the platform to collect anonymous data to study what students often get right or wrong when they practise these categories, useful in connection to research on metalinguistic knowledge and the ability to analyse Swedish grammatically.

Apart from research related to Kielipankki’s resources and areas of interest I am also the current project manager of Finland Swedish Online (FSO), an online course in Finland Swedish created at University of Helsinki based on an Icelandic model (Icelandic Online). FSO is currently part of SAFMORIL, one of the K-Centres within CLARIN. One of my aims have been that FSO would not only be something which supports the learning of a language but also a possibility to study language acquisition by seeing if it is possible to trace the development of learners in FSO if they grant access to that information. (Icelandic Online has done research on this based on their data.)


