His current work includes, among other things, fine-tuning large language models that are optimized for Finnish and Nordic languages. These openly available LLMs have been created through successful academia-enterprise collaboration.
Who are you?
I am Aku Rouhe. For several years, I did research in the Aalto University Speech Recognition research group, and defended my doctoral thesis there this past February. After Aalto, I moved to Silo AI (now owned by AMD), where I work with large language models (LLMs) – I have moved from speech to text. My interest in language is also part of my free time in creative writing.
What is your research topic?
In my doctoral thesis, I compared end-to-end models with more traditional multi-model decomposed systems. In recent years, both the academia and commercial deployments in speech recognition have largely moved to end-to-end models. However, my work showed how multi-model decomposed systems remain a competitive alternative, for instance, in terms of recognition accuracy. Indeed, the main advantage of end-to-end models is probably their simplicity.
End-to-end models often require vast training resources. Thus, it was important for me to study end-to-end models applied to under-resourced languages as well.
My current work at Silo is on fine-tuning large language models such as Poro and Viking, which are models optimized for Finnish and Nordic language. These LLMs were developed in a collaborative research project between Silo and TurkuNLP.
How is your research related to Kielipankki?
End-to-end models hunger for data, so large corpora are needed. I was involved in compiling the Aalto Finnish Parliament ASR Corpus 2008-2020, which consists of Finnish Parliament plenary session recordings, and also in the Lahjoita Puhetta project, where volunteers donated their speech to produce the Puhelahjat corpus. I got to combine both of these large speech corpora in an article that was published when I was finalizing my PhD, at a time when I was involved with the LAREINA project. Nowadays, the Finnish speech recognition resources are respectable for a language spoken by so few.
Recent publications
Rouhe, A., Grósz, T., Kurimo, M. 2024. Principled Comparisons for End-to-End Speech Recognition: Attention vs Hybrid at the 1000-Hour Scale. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 623-638, 2024.
Virkkunen, A., Rouhe, A., Phan, N. et al. 2023. Finnish parliament ASR corpus. Lang Resources & Evaluation 57, 1645–1670 (2023).
Moisio, A., Porjazovski, D., Rouhe, A. et al. 2023. Lahjoita puhetta: a large-scale corpus of spoken Finnish with some benchmarks. Lang Resources & Evaluation 57, 1295–1327 (2023).
Rouhe, A., Virkkunen, A., Leinonen, J., Kurimo, M. 2022. Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0. Proc. Interspeech 2022, 3543–3547.
Corpora
More information
- Speech Recognition research group at Aalto Univerisity
- LAREINA – Language Resource Infrastructure for AI (2023–25)
- Donate Speech (Lahjoita puhetta) campaign (2020–24)
- Poro and Viking language models (Hugging Face)
The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers of Social Sciences and Humanities to use, refine, preserve and share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.