From Text to Knowledge

Helsinki Centre for Digital Humanities (HELDIG) was launched by a kick-off symposium on Oct 6, 2016 that was attended by some 200 friends of Digital Humanities. The first HELDIG Digital Humanities Summit in 2017 provided a snapshot of activities within the centre and its collaboration network and in HELDIG Digital Humanities Summit 2018 the overarching theme of was Infrastructures for Digital Humanities.

The special theme of HELDIG Summit 2019 is From Text to Knowledge. Most content used in DH research and applications is available originally in more of less unstructured textual form, e.g., as books, articles, newspapers, legislation and legal documents, web pages, social media discussions, parliamentary materials, letters, and folklore. Even in databases, such as collections in museums, archives, and libraries, much of the content may be in unstructured textual form. A key task in using texts computationally is to extract – in one way or another – structure and meaning from strings using methods such as linguistic analysis and natural language understanding, named entity recognition and linking, relation and event extraction, data mining, machine learning, topic modelling, and reference and network analysis.

To set context, the day starts with the keynote "Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology" by Dr. Marieke van Erp from KNAW, Amsterdam. After this follows presentations regarding tools and infrastructures for digitizing, processing, and analyzing large text data, and services for publishing textual datasets.

In the afternoon, we have the keynote "Legislative data portals and linked data quality" by prof. Jose Emilio Labra Gayo from University of Oviedo. The day continues with presentations of research projects and applications using digital humanities approaches for analyzing texts. To end the day, bubbles and nibbles are served in a social networking event.

Keynote  
1 Square pegs and round holes: addressing the mismatch between humanities questions and the state-of-the-art in language technology Dr. Marieke van Erp
KNAW, Amsterdam

The use of computational methods in humanities research is gaining popularity and leading to new insights. But as we move from distant reading methods to deeper language understanding, we find that many state-of-the-art language technology tools don't behave quite as advertised in publications. The corpora humanities scholars investigate display a wide range of language phenomena, plus humanities scholars do not necessarily have the same goals when they apply these language technology tools as the computational linguists who developed them. The variety in time span, genre, digitisation quality and corpus heterogeneity show the gap between the two research domains.

In this talk, I will discuss several projects in which we needed to address the mismatch between language technology tools and the humanities research objectives, and how we can go forward in fitting our computational methods to the diversity of humanities research questions.

Bio: Marieke van Erp leads the Digital Humanities Lab at the Royal Netherlands Academy of Arts and Sciences Humanities Cluster in Amsterdam, the Netherlands. Her research is focused on combining natural language processing and semantic web applications in the digital humanities domain. She previously worked on the European NewsReader project, which was aimed at building structured indexes of events from large volumes of financial news and the CLARIAH project, a large Dutch project to develop infrastructure for humanities research.  

Tools and Infrastructures  
2 Publishing linguistic typological data at UH Kaius Sinnemäki
University of Helsinki
3 Words and actions Andrey Indukaev and Daria Gritsenko
University of Helsinki

Words and actions project, presentation
The project aims at developing new theoretical perspectives and computational tools helping researchers to study the use of ideas – concepts, theories, ideational constructs – in political argumentation. Digital communication has extended the space of public debate, resulting in increased diversity of perspectives presented in various online fora. From the technical perspective, this means the deluge of data. Both circumstances complicate the study of ideational dimension of political action in contemporary digital societies. To address these issues, 'Words and Actions' project blends the perspectives of the disciplines studying ideas and political actions, such as political sociology and conceptual history, with that of computational linguists and computer science, to advance theoretical concepts and their empirical applications in large and diverse corpora.

4 Nokia – Person, Location, Organisation, Product, Event, Time or Common Noun? – Automatically Recognizing and Categorizing Names in Finnish Text Krister Lindén
University of Helsinki

FINER is a tool for recognizing named-entities, e.g., person, location, organization, product, event and date as well as many of their subcategories. It has been developed and tested on various data sets such as museum object metadata descriptions, YLE film metadata, Finnish court cases, Wikipedia articles as well as Finnish technology related news articles. The FINER toolkit outperforms two recently published state-of-the-art neural network architectures. The systems were evaluated on two external data sets consisting of Digitoday and Wikipedia articles corresponding to in-domain and out-of-domain evaluation sets, respectively. The best performance was yielded by the FINER toolkit with total F1-scores of 85.20 and 79.91, respectively.
FiNER and Finnish Tagtools 1.4

5 Digitized newspaper clippings Tuula Pääkkönen
National Library of Finland
In this presentation we go through the clipping functionalities of the National Library's https://digi.kansalliskirjasto.fi service and what kind of newspaper clipping collections there are already. We invite you to utilize the over 15 million of digitized pages that the newspapers and journals have.
6 Handwritten text recognition & search platform for historical court records Sampo Viiri and Ville-Pekka Kääriäinen
National Archives of Finland

https://transkribus.eu/r/kansallisarkisto/ platform, presentation
The National Archives of Finland have recently developed and published a platform where you can perform searches and browse automatically transcribed documents from 19th century Finnish Renovated District Court Records. The platform was created by the National Archives of Finland within the Horizon 2020 project READ (Recognition and Enrichment of Archival Documents). The READ project (2016–2019) was an EU funded international collaboration between 14 partners drawn from the domains of computer science, archives and humanities research. The aim of the project was to develop handwritten text recognition and keyword spotting technologies.

7 Learning to understand languages with neural networks Jörg Tiedemann
University of Helsinki
FoTran project, presentation
Natural language understanding is the “holy grail” of computational linguistics and a long-term goal in research on artificial intelligence. The aim of the project is to develop models that learn to understand human languages by training on implicit information given by large collections of human translations. Translations are considered as alternative explanations providing additional views on information encoded in natural language. In FoTran we apply massively parallel data sets to acquire language-agnostic meaning representations that can be used for reasoning with natural languages and for other downstream tasks that require a deep understanding of the linguistic input.
8 Linked Open Data Service about Historical Finnish Academic People in 1640–1899 Petri Leskinen and Eero Hyvönen
University of Helsinki and Aalto University

AcademySampo – Finnish Academic People 1640–1899 project, presentation
The Finnish registries "Ylioppilasmatrikkeli" 1640–1852 and 1853–1899 contain detailed biographical data about virtually every academic person in Finland during the time period. We present first results on transforming these registries into a Linked Open Data service using the FAIR principles. The data is based on the student registries of the University of Helsinki, formerly the Royal Academy of Turku, that have been digitized, transliterated, and enriched with additional data about the people from various other registries. Our goal is to transform this largely textual data into Linked Open Data using named entity recognition and linking techniques, and to enrich the data further based on links to internal and external data sources and by reasoning new associations in the data. The data will be published  as a Linked Open Data service on top of which tools for searching, browsing, and analyzing the data in biographical and prosopographical research are provided.

9 Exploring the Identity of the Ruling Elite in Cuneiform Text Heidi Jauhiainen
University of Helsinki

Centre of Excellence on Ancient Near Eastern Empires, presentation
The Centre of Excellence on Near Eastern Empires aims to describe the identity of the ruling elite in ancient Mesopotamia, modern-day Iraq, c. 3000 – c. 500 BCE. Our main sources are cuneiform texts from transcribed clay tablets. Here we present some preliminary results, e.g., a study on divine names demonstrating how the conquering Assyrian elite promoted their identity as rulers through the worship of their main deity Assur, and how he was integrated into the pantheon of established deities at the time. We describe how we arrived at the results by analysing semantically similar words in Akkadian texts using methods such as neural networks and social network analysis while hermeneutically evaluating the results with Korp provided by the Language Bank of Finland in FIN-CLARIN.

10 Overcoming Civil Wars? Comparative Conflict Resolution Models for Generational Recovery Jussi Pakkasvirta
University of Helsinki

This research will analyze different civil wars from an international and comparative perspective. It develops – with large amount of data – novel theoretical framework and methods, and fresh research ideas to apply in modern post-conflict resolution. The main research focus is on trans-generational transmission of trauma and generational recovery – especially on the memories of the third post-conflict generation. The proposal will also create ‘heritage solutions’ (post-memory tools of forgetting and forgiveness), which could be used in/after actual civil wars around the world. These conflict resolution models, can be social, political and cultural practices – which will be identified through specific data-oriented research questions.

11 Studying pseudo-history and text reuse in Finnish and Russian internet discussions Reima Välimäki, Heta Aali, Mila Oiva, Anna Ristilä and Harri Hihnala
University of Turku

The Ancient Finnish Kings: a computational study of pseudohistory, medievalism and history politics in contemporary Finland and Russia (2019–21) project, presentation

12 Time Machine – Finland’s agenda in the massive European flagship project Tomi Ahoranta
National Archives of Finland
Time Machine project, presentation
Time Machine is a massive European initiative that aims at solving some of the biggest challenges of our time in the field of digital cultural heritage. If everything goes smoothly, the project will begin in 2021. In my presentation, I will tell what the project is all about and what we are doing in Finland to connect to it. Please visit www.timemachine.eu to find out more about the intended project. You are also welcome to join our team in https://www.timemachine.eu/registration/.
Keynote  
13 Legislative data portals and linked data quality Prof. Jose Emilio Labra Gayo
University of Oviedo

The talk will be divided in two parts: the first will describe the History of the Law project developed for the Chilean National Library of Congress: a production system that employs Semantic Web technologies, Akoma-Ntoso, and tools that automate the marking of plain text to XML and RDF, enriching and linking documents. These semantically annotated documents enable the development of specialized political and legislative services, and to extract information for a legislative data portal. The second part will be about linked data quality with more emphasis on RDF data description and validation.

Bio: PhD. Jose Emilio Labra Gayo from the University of Oviedo, Spain, is the main researcher of the WESO (WEb Semantics Oviedo) research group which applies semantic technologies to different domains like public administrations, eProcurement, life sciences, etc. He was a member of the W3C Data Shapes working group, co-author of the “Validating RDF data” book (http://book.validatingrdf.com) and maintains the online RDF validation service RDFShape (http://rdfshape.weso.es). 

Applications  
14 ’Metoo machine’ for those who do not campaign Minna Ruckenstein
University of Helsinki
Lääketutka website, presentation
Building on the findings of a study of antidepressants and their life-effects, taking advantage of a computational tool, Medicine Radar, this presentation argues that large datasets can be used for uncovering first-person experiences that need more attention.
15 Disappearing Discourses: Avoiding anachronisms and teleology with data-driven methods in studying digital newspaper collections Elaine Zosa, Simon Hengchen, Lidia Pivovarova, Jani Marjanen and Mikko Tolonen
University of Helsinki
Research on the past tends to focus on topics that are relevant for the present. Recent unsupervised approaches allow for changing the perspective and concentrating on concepts that were important at the time of original publication even if they have since become less central. We claim that there is great potential in looking for themes that disappeared once new topics and values took over in the public sphere as they capture relevant parts of the historical experiences of past readers. This paper aims at identifying such disappearing discourses by using dynamic topic modeling for the collection of Finnish newspapers from the 19th to the early 20th century.
16 Bibliographic Data Harmonization in Research Leo Lahti & Helsinki Computational History Group
University of Turku and University of Helsinki
Helsinki Computational History GroupTurku Data Science Group, presentation
Research potential of bibliographic metadata collections have been recognized for decades but questions of data representativeness, completeness, and reliability have posed challenges for large-scale research use. Structured metadata collections can also remarkably complement and support the analysis of other digital data streams, such as full texts or audiovisual material. We showcase how a systematic algorithmic framework for large-scale data harmonization has helped us to overcome these challenges and generate new insights into broad historical patterns of knowledge production in Finland and Europe.
17 New words in early English letters: How to find them and what they can reveal Tanja Säily, Eetu Mäkelä, Mika Hämäläinen
University of Helsinki
We apply a big-data approach to analysing the use of new vocabulary in a sociohistorical corpus. Our contribution is threefold: (1) we study a wider range of neologisms than previous corpus-based research has done; (2) to enable such a large-scale investigation, we develop a semi-automated pipeline; and (3) while building upon existing historical research and resources, we cover a wider social spectrum, as previous work is biased towards published texts by well-known authors. We present a case study of 17th-century neologisms identified through our pipeline. In addition to analysing their social embedding (who used them and why), we will discuss problems and solutions regarding the pipeline under development, including our methods of spelling normalization required to map the words across the resources.
18 Visualizing Mito: From text to a procedural view of Japanese intellectual historiography Aliz Horvath
University of Chicago
Digital humanities constitutes a rapidly developing “field”, but it is still primarily dominated by inquiries focused on Western themes and texts. In this talk, however, I will introduce a possible application of digital tools, specifically data visualizations, in the context of the intellectual history of historiography in East Asia. Focusing on the procedural study of the Japanese Mito School, a controversial scholarly group that compiled the Dai Nihonshi (The History ofGreat Japan, 1657-1906), the most monumental history writing product in Japan, my project explores the shifting dynamics of intellectual history and historiography, as well as the significance of foreign elements in the formation of nationalism in an East Asian context. Due to the monumentality of the overarching theme, more specifically the length of the Dai Nihonshi, the 250 year-long compilation process, and the high number of contributors (more than 150 individuals), I developed a hybrid and integrated methodology to process the large amount of data by intertwining the close reading of the Dai Nihonshi and the individual records of the compilers with the embedded visualizations of the authors’ biographical details. My presentation will explain how the nature of dealing with non-Latin scripts affected the research process that led from text to knowledge.
19 Tracing democratization in (big and messy) digital newspaper archives Turo Hiltunen, Turo Vartiainen and Minna Palander-Collin
University of Helsinki
Democratization, Mediatization and Language Practices project, presentation
The British Library Newspapers database provides a plentiful source for linguistic research with its 2 million newspaper pages of national and regional newspapers from 1732 to 1950. The data exist as OCRed text files, which, in principle, constitute an ideal source for exploring how societal changes such as democratization and changes in language practices are intertwined. In practice, however, it has been a complex process to tame the data to the extent that it can be used to answer our sociolinguistic research questions. We shall elaborate on this process and methodological issues including the granularity of data, lack of linguistic annotation, quality of scanned documents, and representativeness.
20 Internet Folklore and Online Mediated Identity – A Netnography Study in Nyishi Community, Arunachal Pradesh Deepika Kashyap
University of Tartu and University of Hyderabad
Modern technologies and innovations have transformed the culture and tradition of the Nyishi community. It has created a new identity for Nyishi people through the internet. The penetration of the internet or the WWW (world wide web) allowed the folk to express and represent their “lore”- culture, custom and tradition to a wider mass. The Internet also opened up for a new mode of communication where people can create and circulate the messages easily. With the advancement of technology and internet, many culture, tradition, and folklore around the world have revived and reaching out to the people by crossing the geographical barrier as well as the time limit. Nowadays people from many communities are coming online and, creating a space of their own and expressing their identity, culture, and agency. Nyishi folklore is also taking pace with the help of internet technology.
21 Language Technology for Publishing and Using Finnish Legislation and Case Law on the Semantic Web Minna Tamper, Arttu Oksanen, Sami Sarsa, Jouni Tuominen, Aki Hietanen and Eero Hyvönen
Aalto University, Ministry of Justice, Edita Ltd and University of Helsinki

ANOPPI, APPI, LawSampo and Semantic Finlex project, presentation
This talk overviews ongoing research on publishing and using Finnish legislation and case law on the Semantic Web, joint work of Aalto University, University of Helsinki (HELDIG), Ministry of Justice, and Edita Publishing Ltd: ANOPPI is a tool for anonymizing legal documents to be published; APPI is a tool for automatic semantic annotation of the documents; Semantic Finlex is the Linked Data service to be used for applications and data analysis of legal documents; LawSampo is the semantic portal to be used by citizens and other end users.

22 Four Generations of Publishing and Using Texts in Digital Humanities:
Forging Sampo Portals in the Digital Age
Eero Hyvönen
University of Helsinki and Aalto University

LODI4DH project and Sampo Portals, presentation
This talk presents the vision and longstanding work since 2003  on creating semantic portals on top of a Linked Open Data Infrastructure for Digital Humanities in Finland. As examples of applications of the infrastructure, the "Sampo" model and series of semantic portals is considered: CultureSampo (2008, culture and collections), TravelSampo (2011, cultural tourism), BookSampo (2011, fiction literature), WarSampo (2015, WW2 war history), BiographySampo (2018, biography and prosopography), NameSampo (2019, toponomastic research), WarVictimSampo 1914–22 (2019, civil war history), AcademySampo (2020, historical academic persons and networks), FindSampo (2020, archaeology and citizen science), and LawSampo (2020, legislation and case law). Many "sampos" are widely used on the web. For example, BookSampo had 2 million and WarSampo 230 000 users in 2018, and tens of thousands of people have already used the more recent systems NameSampo and BiographySampo. The presentation analyses this development in terms of four generations of semantic portals separated by paradigm shifts. As an example, BiographySampo is presented on video to illiustrate features of all four generations.

23 How to Use Linked Data Infrastructure for Digital Humanities? – Practical View Jouni Tuominen
University of Helsinki and Aalto University
This talk presents a practical view on how Linked Data can be used for Digital Humanities, with a focus on SPARQL queries for accessing the data for analysis purposes.
24 Current work in Helsinki Computational History Group Mikko Tolonen
University of Helsinki
Helsinki Computational History Group, presentation
25 FCAI Special Interest Group in Language, Speech and Cognition Jörg Tiedemann
University of Helsinki

Human intelligence is based on language and cognition. Intelligent interaction with machines and human-like reasoning require language processing. Language and speech technology as well as cognitive science provide the essential models for proper language understanding and generation at the core of AI. The purpose of the special interest group on Language, Speech and Cognition (FCAI-SIG-LSC) is to build a platform for discussing advances in natural language processing in relation to human cognition and the development of artificial intelligence.

FCAI-SIG-LSC will address topics that study and develop the ability of machines to communicate with humans and understand human-to-human communication, to perform complex reasoning and to learn from spoken interaction and linguistic data.

The goals of FCAI-SIG-LSC are

  • to organize events that focus on the connections between language technology, cognitive science and AI development
  • to support collaborations between the commercial sector and academic research institutes in the development of language and speech technology
  • to provide a platform for discussing the impact of natural language processing and AI on the society to establish a research programme on language, speech and cognition under the umbrella of FCAI
