In the Language Bank: Sampo Pyysalo

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Sampo Pyysalo tells us about his research on natural language processing. Openly available large language models are necessary for developing tools similar to ChatGPT also for smaller languages, such as Finnish.

Who are you?

I’m Sampo Pyysalo, University Research Fellow at the of the University of Turku.

What is your research topic?

My research is on machine learning approaches to natural language processing, with particular focus on processing Finnish text and analyzing biomedical domain scientific literature. A lot of my recent work revolves around training large neural network models, including general ”foundation” models such as and as well as task-specific models such as a . I also work on data, both compiling raw text resources for the unsupervised training of foundation models and running manual annotation efforts to create resources for supervised training, such as the and corpora.

Large neural language models are central to a lot of state-of-the-art natural language processing and the basis for tools such as , but most such models focus on English and many of the best models are not publicly available. We believe that openly available Finnish models such as FinBERT and FinGPT are necessary to enable the creation of tools for processing Finnish language with comparable capabilities to tools available for English.

How is your research related to Kielipankki?

Creating large language models from scratch requires billions of words of text, and collections of Finnish of this size are not readily available. To compile sufficiently large corpora for language model training we have drawn on various sources, including web crawls and resources available through Kielipankki such as the , the and the. We also distribute resources created by TurkuNLP through Kielipankki among other channels.

In the near future, we hope that we will be able to provide access to the full text resources used to create our models for research purposes through Kielipankki to improve the replicability of our work and to make it easier for future efforts to create models for Finnish.

Publications

J. Luoma & LH. Chang & F. Ginter & S. Pyysalo. 2021. . In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 135–144, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.

A. Virtanen & J. Kanerva & R. Ilo & J. Luoma & J. Luotolahti & T. Salakoski & F. Ginter & S. Pyysalo. 2019. . In CoRR, abs/1912.07076.

Corpora

(data available via GitHub)
(data available via GitHub)
The resource group in Kielipankki
The resource group in Kielipankki
resource group in Kielipankki

More information

, a version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group
, generative GPT-3-like models for Finnish
, a Named Entity Recognition system for Finnish (based on FinBERT and a new NER annotation layer of the UD_Finnish-TDT treebank)

Corpora

(data available via GitHub)
(data available via GitHub)
The resource group in Kielipankki
The resource group in Kielipankki
resource group in Kielipankki

More information

, a version of Google’s BERT deep transfer learning model for Finnish, developed by the TurkuNLP Group
, generative GPT-3-like models for Finnish
, a Named Entity Recognition system for Finnish (based on FinBERT and a new NER annotation layer of the UD_Finnish-TDT treebank)

The consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. is the collection of services that provides the language materials and tools for the research community.

18.9.2023

Sampo Pyysalo

News

Language

Share this page

Newsletter