New semi-automatically annotated corpus of Babylonian texts

The Centre of Excellence in Ancient Near Eastern Empires has published an annotated text corpus of some 6,000 Babylonian texts from the sixth and fifth centuries BCE. The texts have been semi-automatically lemmatized and part-of-speech tagged.

Lemmatized and part-of-speech-tagged texts are essential for many applications in computational Assyriology. However, the number of such texts from the Neo-Babylonian and Persian periods has been small in comparison to the wealth of texts from the Neo-Assyrian period, available on the Open Richly Annotated Cuneiform Corpus (Oracc). To remedy this situation, team 1 at the Centre of Excellence in Ancient Near Empires has created a linguistically annotated corpus of some 6,000 Babylonian texts primarily from the sixth and fifth centuries BCE.

The texts have been semi-automatically lemmatized using Aleksi Sahala’s BabyLemmatizer. As the training data from Oracc did not cover all the lemmas and word forms attested in the corpus, the output of the lemmatizer was improved manually by providing words it had not seen in the training data. Data from the Prosobab database (Waerzeggers, Groß, et al. 2019) was used to lemmatize previously unattested personal names. With relatively few manual corrections, the accuracy of lemmas is 94% and the accuracy of part-of-speech tags 96%.

The corpus consists of two sub-projects. The first part contains all Babylonian cuneiform texts available on Achemenet in December 2020. These include texts from the Murašû archive, CT 55, YOS 7, Jursa 1999, and Strassmaier’s Camb, Cyr, and Dar. The second sub-project is titled “Babylonian Administrative and Legal Texts” (BALT). The core of this group are some 2,600 texts from the legacy data of the late János Everling, including texts published in AnOr 8, CT 49, GCCI 1–2, Nbk, TuM 2/3, UCP 9/1, UCP 9/3, UCP 9/12, VS 3, and YOS 17. In addition, BALT contains Babylonian letters published in Levavi 2018 and Hackl et al. 2011 and 2014, and legal texts from Sippar published in Waerzeggers 2014. 

The annotated data is available in various formats that are suitable for both traditional and computational Assyriological research.

  1. CoNLL-U files are intended for computational analysis: Achemenet and BALT.
  2. Korp allows simple and complex searches in the data and presents the results as a keyword-in-context concordance list. Korp also provides metadata for each text and offers statistical information on the search results. Achemenet on Korp and BALT on Korp .
  3. Using lexical networks, the user can explore the semantic connections between the words that co-occur in the corpus. The use of lexical networks and Korp in tandem allows the user to study Akkadian semantics in an easy and intuitive way.
  4. Part of the BALT corpus (Everling’s legacy data, Levavi 2018, and Waerzeggers 2014) is also available as an Oracc project. BALT includes all the typical features of an Oracc project, but it is the first major Oracc project that has been created using a predominantly automatic process of annotation.

The Neo-Babylonian text corpus was created at the Centre of Excellence in Ancient Near Eastern Empires, hosted by the University of Helsinki and funded by the Research Council of Finland (decision nos. 312051, 336673, and 352747). The corpus was created by Tero Alstola, Aleksi Sahala, Jonathan Valk, and Matthew Ong. Linda Leinonen, Matias Sakko, Senja Salmi, and Repekka Uotila assisted in cleaning the data and creating metadata.

The authors thank Johannes Hackl, Bojana Janković, Michael Jursa, Yuval Levavi, Martina Schmidl, Caroline Waerzeggers, and the Achemenet project for a permission to use their transliterations. We also thank the NaBuCCo and Prosobab projects for providing us with other types of data. Finally, thanks are due to Niek Veldhuis and Heidi Jauhiainen for their help at various stages of the project.

 

Literature

Hackl, Johannes, Bojana Janković, and Michael Jursa. 2011. “Das Briefdossier des Šumu-ukīn.” KASKAL 8: 177–221.

Hackl, Johannes, Michael Jursa, and Martina Schmidl. 2014. Spätbabylonische Privatbriefe. With contributions by Klaus Wagensonner. Alter Orient und Altes Testament 414/1. Münster: Ugarit-Verlag.

Jursa, Michael. 1999. Das Archiv des Bēl-rēmanni. Uitgaven van het Nederlands Historisch-Archaeologisch Instituut te Istanbul 86. Istanbul: Nederlands Historisch-Archaeologisch Instituut.

Levavi, Yuval. 2018. Administrative Epistolography in the Formative Phase of the Neo-Babylonian Empire. Dubsar 3. Münster: Zaphon.

Waerzeggers, Caroline. 2014. Marduk-rēmanni: Local Networks and Imperial Politics in Achaemenid Babylonia. Orientalia Lovaniensia Analecta 233. Leuven: Peeters.

Waerzeggers, Caroline, Melanie Groß, et al. 2019. “Prosobab: Prosopography of Babylonia (c. 620–330 BCE).” Leiden University.