What is it about?
Word embeddings, also known as word vector representations, have recently become very popular in Natural Language Processing. A particularly interesting feature of word embeddings is their ability to encode information about word similarity. For example, by calculating the nearest neighbors for the vector of BANANA, one is expected to find vectors of several other fruits. Such a powerful and completely unsupervised method yields an interesting opportunity to study the lexicon of now extinct ancient languages, such as Akkadian, the language of Babylonians and Assyrians documented in Mesopotamian cuneiform sources from 2400 BCE to the first century CE.
The project undertaken here aims to build a gold standard for evaluating word embeddings for the Akkadian language as a part of a six month online workshop organized by Ludwig-Maximilians-Universität München and the University of California, Berkeley. Several gold standards have been published for modern languages, often by translating previously published English gold standards (like WS353) to other languages. However, in the case of Akkadian, the translation of a previous gold standard is out of question due to the chronological and cultural gap between the ancient Mesopotamian civilizations and the modern world. For this reason, we create a new gold standard consisting of 300 word pairs, manually ranked according their similarity by seven independently working experts of the Akkadian language.
Why is it important?
Word embeddings can be produced by using several different neural (fastText, Word2vec, GloVe etc.) and count-based methods (word association measures combined with matrix factorization). However, it is very difficult to determine which methods are the most suitable for a given data set, and what kind of hyperparameters within a single model produce the most reliable results.
Building a gold standard for Akkadian allows us to measure the performance of several models. This is important not only for finding the best model from the already published ones, but also for improving them by taking into account certain peculiarities of Akkadian texts. As word embeddings have lots of uses in various machine learning applications (e.g. machine translation, syntactic parsing, disambiguation of lemmatization/POS-tagging), an ability to produce state-of-the-art word embeddings will likely be a valuable asset for the future research done in Computational Assyriology.
Preliminary evaluation of the models has raised some interesting philological and technical research questions, namely in the form of agreement or disagreement between the models and the human reviewers. Especially intriguing are those pairs of words that were perceived as nearly synonymous by human reviewers, but which the models considered as very dissimilar. Where do these contradictions come from? Are the models dismissing something, or do humans perceive connections based on out-domain knowledge? How about the cases where the humans did not see a connection but the models did?
We hope to answer these questions, as well as to publish the gold standard by the end of 2020 when the LMU-UCB workshop will be concluded. The people involved in the creation of the gold standard are Aleksi Sahala (Helsinki), David Bamman (Berkeley), Niek Veldhuis (Berkeley), Jamie Novotny (Munich), Poppy Tushingham (Munich), Giulia Lentini (Munich), Beatrice Baragli (Munich), Joanna Töyräänvuori (Helsinki), Johannes Bach (Helsinki) and Saana Svärd (Helsinki).