In the Language Bank: Jörg Tiedemann

The Language Bank of Finland offers a comprehensive set of resources, tools and services in a high-performance environment. Jörg Tiedemann tells us about his work with resource development and OPUS, the World’s largest collection of openly available parallel translation datasets with a wide language coverage.

Who are you?

My name is Jörg Tiedemann and I am leading the language technology research group at the University of Helsinki. We are part of the Department of Digital Humanities and our students have a study track in the BA in Languages and the MA on Linguistic Diversity and Digital Humanities. My own background is in computer science from my undergraduate studies in Germany and computational linguistics from my doctoral studies in Uppsala, Sweden. The appointment as professor in language technology in Helsinki started in 2015 and since then I enjoy the multidisciplinary environment in our team.

What is your research topic?

My main research interests are connected with multilingual natural language processing from various perspectives. A lot of my work has been devoted to application-oriented research in particular in the field of machine translation (MT). Resource development has been a big part of my life and, already during my PhD, a lot of my time went into the collection and alignment of large, multilingual parallel corpora. For more than two decades, I have maintained OPUS, the World’s largest collection of openly available parallel translation datasets with a wide language coverage. This collection has been a main source for the development of translation technology world-wide and its language coverage is unique and invaluable for research on inclusive NLP.

In recent years, we pushed our efforts into the extension of the OPUS ecosystem to cover all aspects of MT development from data to tools and deployment. Pre-trained translation models are available from OPUS-MT, software packages are released for data manipulationtrainingdistillingdeploying and evaluating modelsWeb interfacesapplications, professional translation toolkits such as OPUS-CAT and dashboards support research, development and use, and our resources belong to the most popular ones on the Hugging Face model and data hub.

Another line of research is related to basic research on multilingual and cross-lingual NLP. The ERC project FoTran focused on representation learning with massively multilingual data and we investigated transfer learning capabilities, modularity and interpretability of large neural translation models. We also looked at uncertainty modeling in another research project and currently focus, among other things, on efficiency of NLP in order to reduce the ever-growing carbon footprint of language technology (within the GreenNLP project).

Finally, our research group is also devoting time to the development of large language models as part of the European project HPLT and OpenEuroLLM. Our contribution to those projects is mostly connected to multilinguality and evaluation, two very important and challenging topics in the field as a whole. Our goal is to push support for otherwise under-represented languages and also to improve multilingual evaluation and to reduce the effect of so-called “hallucinations” of generative AI.

How is your research related to Kielipankki – the Language Bank of Finland?

Most of our research is data-intensive and heavily depends on data collections, empirical evaluation and iterative training of models with compute-heavy machine learning. Language resources are essential in this process and we are both, providers and users, of the Language Bank of Finland. Even though most of our work is focused on machine learning and model development, we are also very much interested in making our resources available for research in the humanities. Many of the datasets we curate are directly interesting for linguistic research or, for example, in translation studies. Similarly, linguistic resources are essential for training, tuning and evaluating neural language models. Furthermore, such language models become essential tools in humanities as well and their influence will steadily grow also in linguistic studies, social sciences and various fields of traditional humanities.

Selected publications

Tiedemann, J., Aulamo, M., Bakshandaeva, D. et al. 2024. Democratizing neural machine translation with OPUS-MT. In Lang Resources & Evaluation 58, 713–755 (2024).

Mikko Aulamo, Nikolay Bogoychev, Shaoxiong Ji, Graeme Nail, Gema Ramírez-Sánchez, Jörg Tiedemann, Jelmer van der Linde, and Jaume Zaragoza. 2023. HPLT: High Performance Language Technologies. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 517–518, Tampere, Finland. European Association for Machine Translation.

Jörg Tiedemann and Ona de Gibert. 2023. The OPUS-MT Dashboard – A Toolkit for a Systematic Evaluation of Open Machine Translation Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 315–327, Toronto, Canada. Association for Computational Linguistics.

Tiedemann, J 2022, From open parallel corpora to public translation tools: The success story of OPUS. In E Volodina, D Dannélls, A Berdicevskis, M Forsberg & S Virk (eds.), LIVE and LEARN : Festschrift in honor of Lars Borin. Research Reports from the Department of Swedish, Multilingualism, Language Technology, Nro GU-ISS-2022-03, University of Göteborg, Göteborg, Sivut 133-138.

Resources

Projects

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.