PMI
FastText
The data used for the graphs has been downloaded as JSON files from Open Richly Annotated Cuneiform Corpus (Oracc) in February 2019. For the analysis we used a dataset consisting of 7,346 texts that have in Oracc been tagged as having been written in “Akkadian.” These texts were written primarily in the Neo-Assyrian period (c. 930–612 BCE) in both Assyria and Babylonia, but earlier and later texts are also included. The texts belong to several genres, with royal inscriptions being the most prominent one in terms of word count.
We standardized the spellings of divine and place names and removed duplicate texts following the procedure explained in Alstola et al. (2019). We only used dictionary forms, as defined in Oracc (following Concise Dictionary of Akkadian), of content words—nouns, verbs, and adjectives—while all the other words have been replaced with an underline character as a placeholder. Since neither the cuneiform script nor the Oracc metadata indicates sentence endings, the text of each document is handled as one continuous line of text.
We used two language technological methods Pointwise Mutual Information (PMI) and fastText to study the semantic domains in which lexemes occur. PMI detects words which co-occur frequently in the dataset. For example, the word “to fear” may co-occur with the words “dark,” “spider,” and “panic.” These patterns relate to syntagmatic relationships between lexemes. On the other hand, fastText can be used to find words which appear in similar semantic contexts. “To be angry,” “to rage,” and “to be furious” are examples of words which are not necessarily used together but are likely to appear in similar contexts. Such relations can be described as paradigmatic relations.
From all the lexemes in our dataset, we chose all those that appear at least 5 times. We then produced with PMI and fastText lists of semantically most similar words to each of these 4930 lexemes. These lists were then visualized with Gephi. Both graphs (paradigmatic graphs produced with fastText and syntagmatic graphs produced with PMI) are also viewable in their English versions. In a very brief summary, this is how we created the four graphs above.
We have used this dataset and these methods to study emotion words and the results have been published in several articles (Alstola et al. 2019; Alstola et al. in preparation; Svärd et al. 2021). All articles connected to the project are listed below, with links to full-text articles when possible.
Tero Alstola, Heidi Jauhiainen, Saana Svärd, Aleksi Sahala, and Krister Lindén. In preparation. ”Digital Approaches to Analyzing and Translating Emotion.”
Svärd, Saana, Tero Alstola, Heidi Jauhiainen, Aleksi Sahala, and Krister Lindén. 2021.
DOI:
Alstola, Tero, Shana Zaia, Aleksi Sahala, Heidi Jauhiainen, Saana Svärd, and Krister Lindén. 2019. “
DOI:
Saana Svärd, Heidi Jauhiainen, Aleksi Sahala, Krister Lindén 2018 "
DOI:
You can upload temporary graphs
Permanent new graphs can be uploaded