Elaine Zosa defends her PhD thesis on Analysis of News Media with Topic Models
On Monday the 6th of March 2023, M.Sc. Elaine Zosa defends her PhD thesis on Analysis of News Media with Topic Models. The thesis is related to research done in the Department of Computer Science and in the Discovery Research group.

M.Sc. Elaine Zosa defends her doctoral thesis Analysis of News Media with Topic Models on Monday the 6th of March 2023 at 13 o'clock in the University of Helsinki Language Centre building, Festive Hall (Fabianinkatu 26, 3rd floor). Her opponent is Professor Krista Lagus (University of Helsinki) and custos Professor Hannu Toivonen (University of Helsinki). The defence will be held in English. 

The thesis of Elaine Zosa is a part of research done in the Department of Computer Science and in the Discovery Research group at the University of Helsinki. Her supervisors have been Professor Hannu Toivonen (University of Helsinki) and Senior AI Scientist Mark Granroth-Wilding (Silo AI).

Analysis of News Media with Topic Models

The news is a detailed record of events, issues, and opinions published daily in every country around the world. In addition to daily news content, many national libraries are digitising their historical newspaper collections. This wealth of material presents many opportunities for scholars in the humanities and social sciences, but the sheer volume of these datasets makes them difficult to organise and explore. Thus, there is a need for computational methods to make this data more accessible to a wide range of users. Topic modelling---a method to extract latent themes from a document collection---is particularly suitable for this task because it produces interpretable outputs, requires minimal supervision, and is designed with language-independent assumptions.

In this thesis we develop and apply topic models to news collections to facilitate the analysis of news media. In particular, we focus on three aspects of news: (1) news is multilingual---news is published in different languages around the world; (2) diachronic---news changes over time; and (3) multimodal---in addition to text, news articles also contain images and other types of data.

We propose two novel topic models for multilingual, diachronic, and multimodal datasets. The first is a topic model designed to model the evolution of topics across multiple languages. The second is a neural topic model for multilingual and multimodal data. We also propose a multilingual topic labelling method that maps news topics to concepts in a language-agnostic news ontology.

We then investigate the use of topic models in the analysis of historical news. First, we use topic models to trace the evolution of societal discourses in a collection of historical newspapers that covers a long period of time. We find that while topic models are fairly reliable at grasping the core of certain discourses, design choices in the models can lead them to obscure discontinuities in the data. Therefore, humanistic interpretation still plays a very important role when using topic models to uncover historical discourses. Second, we evaluate the performance of embedding-based topic models in news collections with optical character recognition (OCR) noise resulting from the digitisation of historical newspapers. We find that incorporating word embeddings into topic models alleviates the negative impact of OCR noise, especially as noise levels increase.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-8842-7.

Printed copies will be available on request from Elaine Zosa: elaine.zosa@helsinki.fi.