Initially, the environment was used to test how OpenAI's artificial intelligence-based speech recognition model WhisperX for transcription performs in transcribing material containing atypical speakers. However, a key problem was WhisperX's tendency to remove speech disfluencies from the transcript, which is essential in our research. WhisperX also does not distinguish between different speakers in a conversation. It was found to be a good and fast tool for raw speech transcription. Speech recognition models are typically developed for data with a single speaker and no overlapping speech. However, our data is conversational data, where at least two speakers often speak over each other. That is why we explored the possibility of using speaker separation tools. However, this proved to be challenging.
The research got off to a good start in June, when four research assistants started working on the project. Based on spring tests, a wav2vec speech recognition model trained on Finnish speech material was chosen as the basis for the repair recognition model. The accuracy of the repair recognition turned out to depend on the quality and quantity of the material, but the results were cautiously promising. Although there is already a lot of annotated data, adding more data seems to further improve the model. In the autumn, we began collecting additional data under controlled sound conditions so that the speakers have separate microphones, which makes it easier to separate the speakers. The annotation of the data is fast compared to previous data, as the analyses in the beginning of the year have given us a clear idea of which features are important for the model.
In addition to the repair recognition model, the project has succeeded in mapping many artificial intelligence-based tools that can be utilized in speech-language research in the future.