Recent Developments in Digital Assyriology - Zoom Workshop, Helsinki, August 26-27 2020

Aleksi Sahala on August 26, 17.

00-17.20, in session 1: Creating and enriching text data.

ABSTRACT: BabyFST – A Finite-State Based Morphological Analyzer for Akkadian

Although Akkadian is a fairly well resourced language with several large text corpora available, it still lacks a proper tool for automatic morphological analysis. Morphological analyzers have proved themselves useful in corpus linguistics especially for languages that feature complex and somewhat opaque morphology.

In this paper we describe BabyFST, a general finite-state based morphological model for Babylonian, a southern dialect of the Akkadian language. The model is capable of providing morphological analysis for different stages of the Babylonian with limited support for Assyrianisms. The model is implemented in LEXC and XFST formalisms, which can be compiled into finite-state transducers by using compilers such as Foma (Huldén 2009) and HFST (Lindén et at. 2009). The proposed system can recognize most of the morphological features of the Akkadian language (number, gender, case, construct state, mood, tense, person, verbal affixation and verbal stem including -t- and -tan- infixation), as well as lemmatize and part-of-speech tag the input data.

The best performance is achieved if the input data is transcribed (as it is in most cases in Oracc), but the system also supports automatic transcription of transliterated texts by using LSTM neural networks and abstract pattern mapping that is able to generalize syllabic transliterations into transcription.

BabyFST paper: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.479.pdf

Auto-transcriber paper: http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.433.pdf

Oraccnlp Github: https://github.com/asahala/oraccnlp

BabyFST Github: https://github.com/asahala/babyfst