From Sherds of Pottery to Machine-readable Hieroglyphic Texts

My research project, From Sherds of Pottery to Open Egyptological Data, aims to promote the digital research of ancient Egyptian hieroglyphic texts. I am an Egyptologist from my background and a member of ANEE. I previously worked with Assyriologists from ANEE to study Akkadian texts using digital methods. I was responsible for the pre- and post-processing of text data and the visualization of the analysis results. My current project started in 2021 with funding from the Finnish Cultural Foundation. Since the beginning of 2022, I have been able to focus on the project thanks to a three-year grant from the Kone foundation.

The use of digital methods in the study of texts requires that the texts are in a machine-readable format. Assyriologists have several corpora of machine-readable cuneiform texts at their disposal. Several ANEE researchers use texts that can be freely downloaded to one’s computer from the Open Richly Annotated Cuneiform Corpus online service. There is no similar service in Egyptology, although certain online portals can be used to search for phrases in which different words are used, and these services are based on corpora of machine-readable texts.

Hieroglyphic texts are more complex in structure than texts written in many other writing systems. Hieroglyphic signs usually form groups; for example, smaller signs are placed above or below an oblong one (figure 1), and sometimes a character can even be on top of another. In fact, Egyptologists have long been producing machine-readable hieroglyphic texts using special hieroglyphic text editors. With these programs, the hieroglyphs can be arranged as they are in the original text and an image of the hieroglyphic text can be produced, which can then be used, for example, in a book. The hieroglyphs are produced using codes based on a standard classification of hieroglyphs, the so-called Gardiner's sign list. The signs are classified into lettered categories according to what they represent; each sign has a number in the category it belongs to (figure 2). The encoding produced with these codes is machine-readable, but since it is stored in a binary file, it cannot be read without a program built for that purpose. It hasn't even occurred to Egyptologists to publish the encoded texts, because they don't use the codes otherwise but interpret the texts directly into transliterated words.

Since there is still no working method for the text recognition of hieroglyphic texts, I produce encoded hieroglyphic texts by hand with a text editor called JSesh. In addition, I build tools for processing and publishing machine-readable texts. One of the tools helps convert a binary file containing encoded text into a text file. The project's main goal is to build a workflow for the semi-automatic transliteration of encoded hieroglyphic texts. For that, I have created language models from two available text corpora the Ramses Translitteration Corpus and Thesaurus Linguae Aegyptiae. The language models consist of all word forms in the texts and their frequency as well as transliterations. The first task is to divide the text into words because hieroglyphic texts do not indicate word or sentence boundaries. Then the language models are used to transliterate the word. It’s probable that not all word forms in the sentence to be transliterated can be found in the language models. One can then examine parts of the word and the transliterations they have and, for example, see which of those transliterations is most likely with the previous word.

There is no tradition in Egyptology to make research data available to other researchers, let alone promote its reuse by publishing it under an open license. That's why the hieroglyphic texts produced in my project will be openly published in a machine-readable format. The tools will also be published for other researchers to use.