In the Language Bank: Aleksi Sahala

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Aleksi Sahala tells us about his research on the development and application of Natural Language Processing (NLP) methods for annotating and analyzing ancient text data.

Who are you?

I am Aleksi Sahala, a postdoc researcher in Assyriology and Language Technology. I am currently working for the University of Helsinki in an Academy of Finland funded project “The Origins of Emesal”, where our goal is to investigate how Emesal, the only known language variety of Sumerian, came to be and evolved over time using computational methods.

I did my master’s degree in Assyriology and Computational Linguistics, and in 2021 I finished my PhD thesis “”. In 2022, I was a visiting scholar at the University of California, Berkeley, and in 2024 I will visit the University of Innsbruck in Austria. I have also worked in close co-operation with the at the University of Helsinki.

What is your research topic?

My research focuses on the development and application of NLP (Natural Language Processing) methods for annotating and analyzing ancient text data. My particular interest lies in the Mesopotamian cuneiform texts written in Sumerian (3200 BCE – 100 CE) and Akkadian (2500 BCE – 100 CE). Analysis of Sumerian and Akkadian texts is not only challenging due to data sparsity and the fragmentary nature of the primary sources, but also due to the complexity of the cuneiform writing system and inflectional morphology. In theory, most words can occur in several thousands of different forms, each of which can also be spelled in several different ways.

My focal point has been on the development of a pipeline that is able to linguistically annotate raw transliterations of cuneiform texts so that these texts can be used for data analysis and visualization. This allows for the analysis of thousands of transliterated texts simultaneously and, for example, the visualization and study of how different words, concepts or entities are related to each other on a larger scale. Although Assyriologists have digitized over 20,000 Akkadian and over 100,000 Sumerian texts in various text corpora, these texts have mostly been studied qualitatively by close-reading. By applying a more computational approach, it becomes easier to reveal larger patterns within specific groups of texts.

I have developed a finite-state morphology for Akkadian (), as well as a language independent neural lemmatizer and tagger with a special support for cuneiform languages (). In addition, I have built a word-embedding-based tool for analyzing semantic relationships of words and in sparse and fragmentary data sets ().

My current project focuses on Emesal, a liturgic variant of the Sumerian language, which is only attested in writing after Sumerian was no longer used as a vernacular. Although it is known that Emesal was used in liturgic context, such as lamentations, and occasional to indicate direct speech of goddesses and women, its origins and evolution are still widely debated. None of the Emesal texts were entirely written in this language variant, but rather in Sumerian, and Emesal was only used here and there as keywords to indicate that the current line or passage should be read in this dialect. The rules behind this code switching, if such ever existed, remain largely unknown. We hope, that a larger scale analysis of Emesal texts could reveal some patterns that could explain, what kinds of environments triggered the use of Emesal words exactly, and how the use of this language variant was introduced in written documents and how evolved over its 2000 year old history.

How is your research related to Kielipankki?

Kielipankki has been co-operating with the Centre of Excellence in Ancient Near Eastern Empires by in Korp concordance service. My responsibilities have been collecting and converting these data sets into Korp-compatible format and developing tools for annotating and harmonizing them with the existing resources in a way, that they can be used efficiently together for quantitative analysis.

Recently, we have been working on the harmonization, lemmatization and tagging of , a collection of Neo-Babylonian administrative and legal documents.

Publications

Alstola, T., Zaia, S., Sahala, A., Jauhiainen, H., Svärd, S., & Lindén, K. (2019). . Journal of Cuneiform Studies, 71(1), 159–180.

Alstola, T., Jauhiainen, H., Svärd, S., Sahala, A., & Lindén, K. (2023). . In The Routledge Handbook of Emotions in the Ancient Near East. Taylor & Francis.

Bennet, E. & Sahala, A. (2023). . In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Ihalainen, P. & Sahala, A. (2020). Evolving Conceptualisations of Internationalism in the UK Parliament. Digital Histories, 199.

Luukko, M., Sahala, A., Hardwick, S., & Lindén, K. (2020). . In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories. The Association for Computational Linguistics.

Sahala, A. J. A. (2017). Johdatus sumerin kieleen. Suomen itämainen seura.

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). . In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3886–3894).

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). . In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). European Language Resources Association (ELRA).

Sahala, A. (2021). . PhD Thesis. University of Helsinki.

Sahala, A., & Töyräänvuori, J. (2022). Kirjoitustaidon kehittyminen. Teoksessa Svärd, S. & Töyräänvuori, J. (toim.), Muinaisen Lähi-idän imperiumit. Kadonneiden suurvaltojen kukoistus ja tuho, s.49–69. Gaudeamus, Helsinki.

Sahala, A., & Svärd, S. (2022). . In The Routledge Handbook of the Senses in the Ancient Near East. Taylor & Francis.

Sahala, A., Alstola, T., Valk, J., & Lindén, K. (2023, June). . In Selected papers from the CLARIN Annual Conference 2022 (pp. 111–119).

Sahala, A. & Lindén, K. (2023). A Neural Pipeline for Lemmatizing and POS-tagging Cuneiform Languages. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Svärd, S., Jauhiainen, H., Sahala, A., & Lindén, K. (2018). . Case Studies on Archaeological Data, Objects, Texts, and Digital Archiving, 2, 224–256.

Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A., & Lindén, K. (2020). . In The Expression of Emotions in Ancient Egypt and Mesopotamia (pp. 470–502). Brill.

Tools

, OpenNMT based neural lemmatizer and tagger. available for Ancient Greek, Latin and various cuneiform languages.
, Finite-state morphology of Akkadian, specifically Babylonian dialect.
, Hyper-parametrized tool for creating PMI+SVD based word embeddings from sparse or fragmentary data sets.

Corpora

More information

The consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. is the collection of services that provides the language materials and tools for the research community.

13.11.2023

Aleksi Sahala

News

Language

Share this page

Newsletter