Who are you?
I am a Professor in Speech and Language Processing and leader of the
What is your research topic?
For my PhD dissertation 25 years ago, I developed neural network algorithms to make automatic speech recognition more accurate and more robust. In order to train statistical models for recognizing speech sounds, it is necessary to utilize large amounts of speech material where the sounds are aligned with the corresponding text. At that time, very few such corpora were available. Thus, the research team had to collect and process the data themselves. When we developed automatic methods for aligning speech and text, it become possible to utilize larger data such as audiobooks and radio and television news (e.g.,
However, sufficient accuracy cannot be reached just by modeling individual speech sounds, since they do not appear separately in speech and in practice they are modified to fit in the word and sentence context. Therefore, the speech recognizer must also be provided with a model of the language in question. On the basis of the language model, the recognizer decides which words and sentences are represented by the observed speech sound sequences. To train the language model, huge quantities of text are required that should also contain a large variety of examples of different types of language use. For training the Finnish speech recognizer, we have used, e.g., the
When it is possible to automatically convert read-aloud speech and dictation into text with sufficient accuracy, this technology can be used in dictation services as well as in many other useful applications, such as transcribing planned speeches or respeaking presentations or television programmes. However, I am even more interested in natural and spontaneous speech that we all use in our everyday conversations and storytelling. Since free speech is the most efficient means of communication for humans, is of utmost importance to have an automatic speech recognizer that can understand this kind of speech when developing Artificial Intelligence systems that are to communicate with people.
The challenges in training models of conversational speech lie in the huge amount of variation in speech and in the limited availability of carefully transcribed resources of natural speech that are suited for training the recognizers. Since written language differs from spoken language in many ways, it is in practice necessary to create the text resources by transcribing speech first.
How is your research related to Kielipankki?
When training the first conversational speech recognizer, we used the
At the moment, we are preparing two new corpora of free speech for publication: an extension of the
Publications related to Kielipankki
Mikko Kurimo (1997). Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models. PhD thesis, Helsinki University of Technology, Espoo, Finland.
Mikko Kurimo, Vesa Siivola, Teemu Hirsimäki, Janne Pylkkönen, Reima Karhila, Peter Smit, Seppo Enarvi, André Mansikkaniemi, Matti Varjokallio, Ulpu Remes, Heikki Kallasjoki, Sami Keronen, Katri Leino, Ville T. Turunen & Kalle Palomäki (tekijän nimet eivät ole missään erityisessä järjestyksessä, paitsi projektin johtaja mainitaan ensimmäisenä). 2000 –2016. AaltoASR – rajoittamattoman sanaston jatkuvan puheen automaattinen tunnistin avoimella lähdekoodilla, Aalto-yliopisto.
Seppo Enarvi & Mikko Kurimo (2013).
André Mansikkaniemi, Peter Smit & Mikko Kurimo (2017).
Juho Leinonen, Sami Virpioja & Mikko Kurimo (2021).
Peter Smit, Sami Virpioja & Mikko Kurimo (2021).
More information on the aforementioned resources in Kielipankki
(Lahjoita Puhetta)
The