Machine learns to assess speech with the help of 18 human raters

Over 2000 Swedish speech samples were assessed in December 2020. The data helps the multidisciplinary DigiTala research project in developing automatic assessment for non-native speech.

The main goal of the DigiTala project is to develop a digital tool for assessing spoken language skills automatically. In order to provide help for humans, the algorithm needs relevant training data. Assessments from human raters are an important part of this data.

In December, 18 Swedish language experts assessed more than 2000 speech samples derived from short reaction tasks. The speech data was collected in the first phase of the DigiTala project in 2015-2017. The speakers were upper secondary school students learning Swedish as a second national language.

Experienced experts rated speech samples in Moodle

Most experts were recruited among the Swedish raters of YKI. In addition, some Swedish teachers as well as four Swedish experts of the DigiTala project participated in the assessments.

The same speech sample was assessed by at least two different raters. This way all the assessments were linked together. The systematic overlap enables analysing inter-rater agreement and overall quality of the assessments. Every recruited rater assessed approximately 280 speech samples.

The raters assessed the speech samples in an online environment that was created in a Moodle course platform. The platform included assessment guidelines and criteria as well as examples from different proficiency levels (A1–B2). These so-called reference samples provide help for the assessors, who can compare each speech sample to the reference samples.

Holistic & analytic ratings for each speech sample

The rating criteria include a holistic proficiency scale and five analytic scales. The proficiency scale was developed on a basis of the previous upper secondary school curriculum (NCC2003). The descriptions of the curriculum are in line with the more technical focus of the DigiTala project, while the current curriculum (LOPS2015) focuses on interaction. Both curricula are Finnish applications of the Common European Framework of Reference.

The analytic rating criteria are based on the goals of our research project. The focus of the project is in assessing pronunciation, vocabulary, grammar, and fluency of L2 speech.

Rater training in Zoom

Before Moodle assessments, an hour-long training session was organized in Zoom. During the training session the raters became familiar with the assessment guidelines and criteria and assessed a few example samples anonymously. The results of the example assessment were further discussed. The purpose of the training was to increase the agreement between the raters and thus the reliability of the assessments.

After the Moodle assessments, the raters answered to a questionnaire. These answers provide important information that can be used in improving the suitability of the test tasks and rating criteria.

More information on human assessments: anna.vonzansen@helsinki.fi

15.3.2021

Heini Kallio, Anna von Zansen

News

Human-centric technology

Share this page

Newsletter