In the Language Bank: Aleksi Sahala

Kielipankki – The Language Bank of Finland is a service for researchers using language resources. Aleksi Sahala tells us about his research on the development and application of Natural Language Processing (NLP) methods for annotating and analyzing ancient text data.

Who are you?

I am Aleksi Sahala, a postdoc researcher in Assyriology and Language Technology. I am currently working for the University of Helsinki in an Academy of Finland funded project “The Origins of Emesal”, where our goal is to investigate how Emesal, the only known language variety of Sumerian, came to be and evolved over time using computational methods.

I did my master’s degree in Assyriology and Computational Linguistics, and in 2021 I finished my PhD thesis “Contributions to Computational Assyriology”. In 2022, I was a visiting scholar at the University of California, Berkeley, and in 2024 I will visit the University of Innsbruck in Austria. I have also worked in close co-operation with the Centre of Excellency in Ancient Near Eastern Empires at the University of Helsinki.

What is your research topic?

My research focuses on the development and application of NLP (Natural Language Processing) methods for annotating and analyzing ancient text data. My particular interest lies in the Mesopotamian cuneiform texts written in Sumerian (3200 BCE – 100 CE) and Akkadian (2500 BCE – 100 CE). Analysis of Sumerian and Akkadian texts is not only challenging due to data sparsity and the fragmentary nature of the primary sources, but also due to the complexity of the cuneiform writing system and inflectional morphology. In theory, most words can occur in several thousands of different forms, each of which can also be spelled in several different ways.

My focal point has been on the development of a pipeline that is able to linguistically annotate raw transliterations of cuneiform texts so that these texts can be used for data analysis and visualization. This allows for the analysis of thousands of transliterated texts simultaneously and, for example, the visualization and study of how different words, concepts or entities are related to each other on a larger scale. Although Assyriologists have digitized over 20,000 Akkadian and over 100,000 Sumerian texts in various text corpora, these texts have mostly been studied qualitatively by close-reading. By applying a more computational approach, it becomes easier to reveal larger patterns within specific groups of texts.

I have developed a finite-state morphology for Akkadian (BabyFST), as well as a language independent neural lemmatizer and tagger with a special support for cuneiform languages (BabyLemmatizer). In addition, I have built a word-embedding-based tool for analyzing semantic relationships of words and in sparse and fragmentary data sets (PMI Embeddings).

My current project focuses on Emesal, a liturgic variant of the Sumerian language, which is only attested in writing after Sumerian was no longer used as a vernacular. Although it is known that Emesal was used in liturgic context, such as lamentations, and occasional to indicate direct speech of goddesses and women, its origins and evolution are still widely debated. None of the Emesal texts were entirely written in this language variant, but rather in Sumerian, and Emesal was only used here and there as keywords to indicate that the current line or passage should be read in this dialect. The rules behind this code switching, if such ever existed, remain largely unknown. We hope, that a larger scale analysis of Emesal texts could reveal some patterns that could explain, what kinds of environments triggered the use of Emesal words exactly, and how the use of this language variant was introduced in written documents and how evolved over its 2000 year old history.

How is your research related to Kielipankki?

Kielipankki has been co-operating with the Centre of Excellence in Ancient Near Eastern Empires by annotating cuneiform texts and publishing them in Korp concordance service. My responsibilities have been collecting and converting these data sets into Korp-compatible format and developing tools for annotating and harmonizing them with the existing resources in a way, that they can be used efficiently together for quantitative analysis.

Recently, we have been working on the harmonization, lemmatization and tagging of Achemenet, a collection of Neo-Babylonian administrative and legal documents.


Alstola, T., Zaia, S., Sahala, A., Jauhiainen, H., Svärd, S., & Lindén, K. (2019). Aššur and his friends: a statistical analysis of neo-assyrian textsJournal of Cuneiform Studies, 71(1), 159–180.

Alstola, T., Jauhiainen, H., Svärd, S., Sahala, A., & Lindén, K. (2023). Digital Approaches to Analyzing and Translating Emotion: What Is Love?. In The Routledge Handbook of Emotions in the Ancient Near East. Taylor & Francis.

Bennet, E. & Sahala, A. (2023). Using Word Embeddings for Identifying Emotions Relating to the Body in a Neo-Assyrian Corpus. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Ihalainen, P. & Sahala, A. (2020). Evolving Conceptualisations of Internationalism in the UK Parliament. Digital Histories, 199.

Luukko, M., Sahala, A., Hardwick, S., & Lindén, K. (2020). Akkadian treebank for early neo-assyrian royal inscriptions. In Proceedings of the 19th International Workshop on Treebanks and Linguistic Theories. The Association for Computational Linguistics.

Sahala, A. J. A. (2017). Johdatus sumerin kieleen. Suomen itämainen seura.

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). BabyFST: Towards a finite-state based computational model of ancient babylonian. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3886–3894).

Sahala, A., Silfverberg, M., Arppe, A., & Lindén, K. (2020). Automated phonological transcription of Akkadian cuneiform text. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020). European Language Resources Association (ELRA).

Sahala, A. (2021). Contributions to Computational Assyriology. PhD Thesis. University of Helsinki.

Sahala, A., & Töyräänvuori, J. (2022). Kirjoitustaidon kehittyminen. Teoksessa Svärd, S. & Töyräänvuori, J. (toim.), Muinaisen Lähi-idän imperiumit. Kadonneiden suurvaltojen kukoistus ja tuho, s.49–69. Gaudeamus, Helsinki.

Sahala, A., & Svärd, S. (2022). Language technology approach to “seeing” in Akkadian. In The Routledge Handbook of the Senses in the Ancient Near East. Taylor & Francis.

Sahala, A., Alstola, T., Valk, J., & Lindén, K. (2023, June). Lemmatizing and POS-tagging Akkadian with BabyLemmatizer and Dictionary-Based Post-Correction. In Selected papers from the CLARIN Annual Conference 2022 (pp. 111–119).

Sahala, A. & Lindén, K. (2023). A Neural Pipeline for Lemmatizing and POS-tagging Cuneiform Languages. In Proceedings of the Ancient Natural Language Processing Workshop at RANLP 2023.

Svärd, S., Jauhiainen, H., Sahala, A., & Lindén, K. (2018). Semantic Domains in Akkadian Texts. CyberResearch on the Ancient Near East and Neighboring RegionsCase Studies on Archaeological Data, Objects, Texts, and Digital Archiving, 2, 224–256.

Svärd, S., Alstola, T., Jauhiainen, H., Sahala, A., & Lindén, K. (2020). Fear in akkadian texts: New digital perspectives on lexical semantics. In The Expression of Emotions in Ancient Egypt and Mesopotamia (pp. 470–502). Brill.


  • BabyLemmatizer, OpenNMT based neural lemmatizer and tagger. Pretrained models available for Ancient Greek, Latin and various cuneiform languages.
  • BabyFST, Finite-state morphology of Akkadian, specifically Babylonian dialect.
  • PMI-Embeddings, Hyper-parametrized tool for creating PMI+SVD based word embeddings from sparse or fragmentary data sets.


More information

The FIN-CLARIN consortium consists of a group of Finnish universities along with CSC – IT Center for Science and the Institute for the Languages of Finland (Kotus). FIN-CLARIN helps the researchers in Finland to use, to refine, to preserve and to share their language resources. The Language Bank of Finland is the collection of services that provides the language materials and tools for the research community.