M.Sc. Jarkko Lagus defends his doctoral thesis "Transformations and document similarities in word embedding spaces" on Friday the 2nd of June 2023 at 13 o'clock in the University of Helsinki Physicum building, Auditorium E204 (Gustaf Hällströmin katu 2, 2nd floor). His opponent is Professor Filip Ginter (University of Turku) and custos Associate Professor Arto Klami (University of Helsinki). The defence will be held in English.
The thesis of Jarkko Lagus is a part of research done in the Department of Computer Science and in the Multi-source Probabilistic Inference group at the University of Helsinki. His supervisor has been Associate Professor Arto Klami (University of Helsinki).
Transformations and document similarities in word embedding spaces
Natural language processing (NLP) studies the ways of automatic analysis and extraction of information from natural language data. It has applications everywhere from text generation to information retrieval. Most modern NLP methods function on top of representations that are often learned using complex embedding models and vast amounts of textual data. A seminal feature of such embedding models is the capability to learn generic representations that allow the transfer of information into a multitude of different language tasks. Once the representations are learned, only slight fine-tuning or a simple model is often needed in order to use them in novel tasks.
A major theme in NLP is document similarity measurement which has applications in areas such as document retrieval, clustering, and classification. In this work, we build upon the idea of matrix representations created out of pre-existing word representations. We extend the existing works by focusing on efficient low-rank representations and introducing novel matrix-based metrics for document comparison. The main motivation for using matrix-based representations, instead of the more popular vector representations, is the capability to retain more information. We show that while matrix representations require more resources, dimensionality reduction methods can be used to make them relatively memory-efficient and computationally fast. We use both contextual and static embeddings to conduct the experiments on different kinds of document-level tasks, showing that the same methods work on both types of embeddings the static embeddings exhibiting greater benefits.
In addition to efficient matrix representations and metrics, we study manipulations and transformations on embedding spaces, introducing methods to extract sentiment and grammar information directly out of pretrained representations. Since word embedding spaces are often learned in a way that linear relationships are mostly preserved, transformations in such spaces can offer simple arithmetic tools for analyzing and extracting information. We show how these linear relationships can be used to extract abstract information and how the usage of lightweight non-linear transformations allows the removal of unwanted biases from the representations in the form of grammatical information.
The focus of this thesis is on the efficient usage of pretrained representations and developing simple and efficient methods that can be integrated as parts of pre-existing processing pipelines. To evaluate the generalizability of the methods more broadly, the Finnish language as a morphologically rich language is used as a counterpart for English.
Availability of the dissertation
An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-9301-8.
Printed copies will be available on request from Jarkko Lagus: jarkko.lagus@helsinki.fi.