Eliel Soisalon-Soininen defends his PhD thesis on Neural Transfer Learning for Truly Low-Resource Natural Language Processing

On Thursday the 6th of July 2023, M.Sc. (Tech) Eliel Soisalon-Soininen defends his PhD thesis on Neural Transfer Learning for Truly Low-Resource Natural Language Processing. The thesis is related to research done in the Department of Computer Science and in the Discovery Research group.

M.Sc. (Tech) Eliel Soisalon-Soininen defends his doctoral thesis "Neural Transfer Learning for Truly Low-Resource Natural Language Processing" on Thursday the 6th of July 2023 at 12 o'clock in the University of Helsinki Porthania building, Auditorium PIII (Yliopistonkatu 3, 1st floor). His opponent is Professor Liviu Dinu (University of Bucharest, Romania) and custos Professor Hannu Toivonen (University of Helsinki). The defence will be held in English.

The thesis of Eliel Soisalon-Soininen is a part of research done in the Department of Computer Science and in the Discovery Research group at the University of Helsinki. His supervisors have been Professor Hannu Toivonen (University of Helsinki) and Senior AI Scientist Mark Granroth-Wilding (Silo AI).

Neural Transfer Learning for Truly Low-Resource Natural Language Processing

The vast majority of the world's languages are low-resource, lacking the data resources required in advanced natural language processing (NLP) based on data-intensive deep learning. Furthermore, annotated training data can be insufficient in some domains even within resource-rich languages. Low-resource NLP is crucial for both the inclusion of language communities in the NLP sphere and the extension of applications over a wider range of domains. The objective of this thesis is to contribute to this long-term goal especially with regard to truly low-resource languages and domains. 

We address truly low-resource NLP in the context of two tasks. First, we consider the low-level task of cognate identification, since cognates are useful for the cross-lingual transfer of many lower-level tasks into new languages. Second, we examine the high-level task of document planning, a fundamental task in data-to-text natural language generation (NLG), where many domains are low-resource. Thus, domain-independent document planning supports the transfer of NLG across domains. Following recent encouraging results, we propose neural network models to these tasks, using transfer learning methods in three low-resource scenarios. 

We divide our high-level objective into three research tasks characterised by different resource conditions. In our first research task, we address cognate identification in endangered Sami languages of the Uralic family, given scarce labelled training data. We propose a Siamese convolutional neural network (S-CNN) and a support vector machine (SVM), which we pre-train on unrelated Indo-European data, lacking high-resource close relatives. We find that S-CNN performs best at direct transfer to Sami, and adapts fast when fine-tuned on a small amount of Sami data. In our second research task, we address a scenario with only unlabelled data to adapt S-CNN from Indo-European to Uralic data. We propose both discriminative adversarial networks and pre-trained symbol embeddings, finding that adversarial adaptation outperforms an unadapted model, while symbol embeddings are beneficial when languages have disparate orthographies. 

In our third research task, we address document planning in data-to-text generation of news, in a domain with no annotated training data whatsoever. We propose distant supervision, automatically constructing labelled data from a news corpus, and train a neural model for sentence ordering, a task related to document planning. We examine Siamese, positional, and pointer networks, and find that a variant of S-CNN results in generation with higher human-perceived quality than heuristic baselines. 

The contributions of this thesis include addressing novel low-resource scenarios considering two NLP tasks, at which the potential of deep learning has not been fully explored. We propose novel approaches to these tasks using neural models in combination with transfer learning, and our experiments indicate their performance in comparison with baselines. Finally, although we acknowledge that rule-based methods and heuristics might still be superior to deep learning in truly low-resource scenarios, our approaches are more language- and domain-independent, supporting a wider coverage of NLP across languages and domains.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-9342-1.

Printed copies will be available on request from Eliel Soisalon-Soininen: eliel.soisalon-soininen@helsinki.fi.