Automating theory-driven text annotation with supervised machine learning

Andrey Indukaev (University of Helsinki)

Recent advances in machine learning are being gradually exported to social sciences and humanities, and the analysis of large volumes of textual data is one of the important areas of application for these methods. In this talk, I present the use of supervised machine learning for scaling-up a theory-driven annotation of texts. The annotation is carried out according to "Justification Theory" by Luc Boltanski and Laurent Thévenot, an influential social theory providing a typology of socially shared perspectives on value and worth. A BERT model is trained on multi-class, multi-label annotation data performed by expert annotators. The data is scraped from petition platform. In this talk, I primarily focus on methodological issues revolving around the question: what state of the art supervised machine learning could bring to research characterized by interpretative approach? When doing such kind of research, the researcher's analytical categories and research objects are constructed and revised in light of the empirical examination of the webs of meaning the studied phenomenon is embedded into. Some researchers claim that unsupervised machine learning has particular affinity to this type of research. I suggest, however, that supervised machine learning workflow could be adjusted to fit the specificities of interpretative research. The first adjustment is to give special attention to an essential, but often overlooked activity, the production of training data. I suggest that this process not only heavily relies on interpretation, but may also lead to advances in interpretative analysis of data, i.e. the update of the analytical categories. In particular, the measures of inter-annotator agreement could be used not only to access the quality of annotated data, but also to reveal problematic annotation categories and, as consequence, weak points in the theories that inform category selection. Second, I suggest that the results of automated annotation can provide insights that go beyond merely increasing the scale of interpretative analysis.

Presentation slides

Aalto HELDIG DH pizza seminar on Friday 20 November 2020 at 12.00 (Zoom)