See also the showreel of completed and ongoing research at http://j.mp/hssci.
Led by Eetu Mäkelä, the HSCI research strand sits at the intersection of digital humanities, computational social science, human-computer interaction and data science. In our opinion, applying modern data processing to complex social and historical data works best when done in collaboration - between the social scientists/humanities scholars who have the questions, between the computer scientists who deeply understand the methods and between the institutions who own and best understand the data used .
At its best, collaboration also has something unique to offer each of these groups inside their own field of study: scholars in the humanities/social sciences are able to tackle questions too labour-intensive for manual study, computer scientists encounter new and challenging use cases for their tools and algorithms, and institutions gain valuable insight into, and feedback on their own data collections and the way they present them.
In this vein, we’ve set out to map and attack the gap between the needs of the interpretive process of inquiry inherent to the humanities and social sciences, and existing computational tools. In particular, we’ve identified the following current shortcomings:
- Primary sources and databases in the humanities, or e.g. online forums used by social scientists often do not contain directly the information of interest to the scholar. Thus, research by necessity includes large amounts of interpretation and inference to produce the final data conclusions are based on. Current tools mostly do not support this, instead only providing quantitative views on top of the original data [2–12].
- Tools need to support contextualization [3, 6, 11–12], and fluidly moving between close and distant reading [3,6,11–12, 14–15] – due to the complexity of the data as well as the process of interpretation, no aggregate view of data can be trusted without the possibility to look deeper into what goes in the numbers. On the other hand, data-based contextualization on the level of close reading allows discovery of related information and patterns that might have otherwise been missed.
- Tools also need more nuanced handling of the ever-present, multiple types of uncertainty, incompleteness and contradictions inherent in humanities and social data [16, 17–21]
- Available datasets have often been created for other purposes. Coming against a new dataset, it is difficult to determine if information relevant to an enquiry really is there and how it is encoded. But even more importantly, data also often arises from a long process of selection and transformation (e.g. in the humanities, from a set of physical books that remain, through selective microfilming and automatic transcription or manual curation, to a dataset or in the social sciences, through opportunistic and thus incomplete mining of social media). All of these steps may incur bias due to uneven sampling, changing descriptive practices or algorithms. Without facilities for detecting and handling such bias, any results based on the material may be faulty [22–25].
- No single tool can encompass the whole variety of different types of datasets, nor of different research needs and processes. Even in the context of a single project, different data sources of interest often have different types of objects or exhibit different levels of structure, and thus have different needs for processing to yield unified data. Yet, most current tools are rigid and overbearing, and do not allow fluidly moving data between tools.
Turned around, the core research questions currently on the agenda for the research strand are thus:
- How to make available complex human datasets understandable, and how to reveal the biases hidden therein?
- How to handle the ever present uncertainty, incompleteness and contradictions in human data?
- How to track the bias and uncertainty through multiple layers of analyses?
- How to integrate qualitative and quantitative steps of processing into a unified whole?
- How to fluidly move between steps in the analysis – giving context to individual entries and documents, while also allowing peeking behind any aggregate views or counts to look at their constituents?
- How to enable scholars to store and ground their interpretations as layers on top of the data?
- How to develop the tools in such a way as to allow them to fit the needs of a scholarly research process instead of dictating boundaries (be they limits on data sets or processes)?
To ensure the tools developed meet the needs of ends users, they are developed in an iterative fashion, in close interaction with humanities and social science projects. This counteracts problems common to many digital humanities and cross-disciplinary projects in general, whereby lack of genuine buy-in from both sides ends up stagnating the project and prohibiting true cross-disciplinary utility [9, 26–31].
A key insight here is also that requirements are simultaneously gathered from multiple projects that operate in different subfields, but deal with similarly formatted data. Through this, solutions devised will be forced to transcend individual scenarios, making them versatile enough to enable their later use in new fields with similar workflows.
Particularly, collaboration partners have been chosen so that 1) all have ready available big data sources of interest, 2) the sources range from fully structured to mostly unstructured and 3) the research questions posed against those sources range from simple to advanced, providing workable avenues of attack. These collaboration projects are listed next, along with their qualities.
Reassembling the Republic of Letters is an EU COST Action joining together a total of 186 people from 33 countries interested in studying early modern and Enlightenment social networks through correspondence and prosopographical information. Primary contacts for collaboration in the network are the University of Oxford (Howard Hotson) who leads the action, and the Humanities + Design laboratory at Stanford University (Nicole Coleman and Dan Edelstein). Here, data of interest is mostly in structured form, but institutionally and geographically spread out, as well as with varying data models. A sampling of different types of datasets to which access has been obtained includes Early Modern Letters Online, Six Degrees of Francis Bacon, French Book Trade in Enlightenment Europe , the French national bibliography, the English Short Title Catalogue (ESTC), the Oxford Dictionary of National Biography and Wikidata. Examples of data-centric research queries would be: Which backgrounds were people going on a Grand Tour from? What did Samuel Hartlib’s network of connections look like? Which cities published the most philosophy in French? Do the stated places of publication correspond with actual places of publication?
A similar project that also deals in mostly structured data but in a different domain is the Mapping Manuscript Migrations transatlantic partnership, which seeks to track the movements of medieval and renaissance manuscripts. The partnership is again headed by the University of Oxford (Toby Burrows, Kevin Page, Pip Willcox), but also includes the University of Pennsylvania (Lynn Ransom), the Institut de Recherche et d’Histoire des Textes (Hanno Wijsman), and Aalto University (Eero Hyvönen). Here, structured data comes from many manuscript collections, among them the Schoenberg Database of Manuscripts, databases of the Bodleian library, and the French Bibale and Medium. Examples of research queries are: where have manuscripts from the Phillipps collection ended up? Where did they come from? How many hands did they pass through?
Expanding from purely structured data into large collections of unstructured text, the NewsEye EU project as well as the COMHIS project at the University of Helsinki (Mikko Tolonen) seek to analyze quantity and content of early modern public communication. In the projects, this is to be done both through metadata, including the ESTC, Fennica and Kungliga bibliographies, as well as through full text collections such as Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO), and the newspaper collections of British Library Newspapers (BLN), the Burney collection, the Nichols collection, the Delpher Dutch newspaper collection, the Finnish national library newspaper collection and the Europeana newspapers collection. Example queries are: what were the quantity ratios between newspaper and book publishing in different cities at different times? What about different book sizes? Which places published most philosophical literature? Which discourses is the term “public” associated with? How does nationalism appear in the media, and how does this differ between countries?
A different need for combining unstructured and structured data comes from the STRATAS project (Terttu Nevalainen, University of Helsinki), where social metadata associated with the Corpus of Early English Correspondence (CEEC) is being interfaced with the texts themselves in an effort to study the social aspects of historical language variation and change. Here, in addition to CEEC, also other historical full text materials are useful as comparisons, such as EEBO, ECCO and the BLN, Burney and Nichols newspaper collections. Further, structured sources for historical language information need to be queried, such as the Oxford English Dictionary (OED), the Middle English Dictionary, and the Historical Thesaurus of English. Example queries are: when did the modern way of spelling cannot together originate? Which social group adopted it first? Do men or women create new words more often? What is the influence of social class on this?
Finally, in the FLOPO project with Anu Koivunen, we are charting the changing relationship between politics and journalistic media in the last 20 years by utilizing the news output of Helsingin Sanomat (the biggest Finnish newspaper), the Finnish public broadcaster YLE, and regional news from Alma Media. In this project, besides the need to delineate particular policy discourses, more complex entities and features (such as framing, style and polarisation) need to be extracted from the unstructured data. Example queries are: who gets to decide how politics are discussed in the Finnish public sphere? How are different types of actors framed in the media, in relation to different policy topics and at different points in time? What is the agency of the media itself?
Further collaboration is also under preparation with multiple parties: first, with Kati Kallio and Mari Sarv on researching interactions between thematic, geographic and linguistic patterns in Finnish and Estonian folk poems based on the Finnish folk poetry database SKVR and the Estonian Folklore Archives. Second, with Marjo Särkkä-Tirkkonen, text and structured metadata in the EU DOOR database is being utilized to discover how fame is argued in different parts of Europe in the hopes of gaining protected geographic status for food items. Finally, with Ilona Pikkanen and Elsi Hyttinen, collaboration is being prepared on dissecting the fiction literature that was published on the pages of Finnish newspapers in the 19th Century, based on the collections held by the Finnish national library.
- Watts, D.J.: Computational social science: Exciting progress and future directions. The Bridge on Frontiers of Engineering. 43, 5–10 (2013).
- Mahrt, M. & Scharkow, M.: The Value of Big Data in Digital Media Research, Journal of Broadcasting & Electronic Media, 57:1, 20-33 (2013).
- Lewis, C., Zamith, R. & Hermida, A.: Content Analysis in an Era of Big Data: A Hybrid Approach to Computational and Manual Methods, Journal of Broadcasting & Electronic Media, 57:1, 34-52 (2013).
- Grimmer, J. & Steward B.: Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts, Political Analysis, 21:3, 267-297 (2013).
- Fuchs, C.: From digital positivism and administrative big data analytics toward critical digital and social media research!, European Journal of Communication, 32:1, 37-49 (2017).
- boyd, d., Crawford, K.: Critical Questions for Big Data. Inf. Commun. Soc. 15, 662–679 (2012).
- van Zundert, J: If you build it, will we come? Large scale digital infrastructures as a dead end for digital humanities, Historical Social Research/Historische Sozialforschung, 165-186 (2012).
- Isoaho, K., Gritsenko, D. & Mäkelä, E.: Topic modelling and text analysis for qualitative policy research, under review in Policy Studies Journal
- Poole, A.: The conceptual ecology of digital humanities, Journal of Documentation, 73:1, 91-122 (2017).
- Grimmer, J.: We Are All Social Scientists Now: How Big Data, Machine Learning, and Causal Inference Work Together, Political Science & Politics, 48:1, 80-83 (2015).
- Wallach, H: Computational Social Science ≠ Computer Science + Social Data, Communications of the ACM, 61:3, 42-44 (2018).
- Hill, M.J.: Invisible interpretations: reflections on the digital humanities and intellectual history. Global Intellectual History. 1, 130–150 (2016).
- Drucker, J.: Graphical Approaches to the Digital Humanities. In: A New Companion to Digital Humanities. pp. 238–250 (2015).
- Mäkelä, E., Lindquist, T., Hyvönen, E.: CORE - A Contextual Reader based on Linked Data. In: Proceedings of Digital Humanities 2016, long papers. pp. 267–269. , Kraków, Poland (2016).
- Jänicke, S., Franzini, G., Cheema, M.F., Scheuermann, G.: On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges. In: Borgo, R., Ganovelli, F., and Viola, I. (eds.) Eurographics Conference on Visualization (EuroVis) - STARs. The Eurographics Association (2015).
- Uboldi, G., Caviglia, G.: Information Visualizations and Interfaces in the Humanities. In: New Challenges for Data Design. pp. 207–218. Springer, London (2015).
- Wang, X., He, Y.: Learning from Uncertainty for Big Data: Future Analytical Challenges and Strategies. IEEE Systems, Man, and Cybernetics Magazine. 2, 26–31 (2016).
- Rowe, W.D.: Understanding Uncertainty. Risk Anal. 14, 743–750 (1994).
- Lustick, I.S.: History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias. Am. Polit. Sci. Rev. 90, 605–618 (1996).
- Plewe, B.: The Nature of Uncertainty in Historical Geographic Information. Trans. GIS. 6, 431–456 (2002).
- Pink, S., Ruckenstein, M., Willim, R., Duque, M.: Broken data: Conceptualising data in an emerging world. Big Data & Society. 5, 2053951717753228 (2018).
- Boyd, D., Crawford, K.: CRITICAL QUESTIONS FOR BIG DATA. Inf. Commun. Soc. 15, 662–679 (2012).
- Kaplan, R.M., Chambers, D.A., Glasgow, R.E.: Big data and large sample size: a cautionary note on the potential for bias. Clin. Transl. Sci. 7, 342–346 (2014).
- Lazer, D., Kennedy, R., King, G., Vespignani, A.: The Parable of Google Flu: Traps in Big Data Analysis. Science. 343, 1203–1205 (2014).
- Cihon, P., Yasseri, T.: A Biased Review of Biases in Twitter Studies on Political Collective Action. Frontiers in Physics. 4, 34 (2016).
- Dombrowski, Q: What Ever Happened to Project Bamboo?, Literary and Linguistic Computing, Volume 29, Issue 3, pp. 326–339 (2014).
- Hine, C: New Infrastructures for Knowledge Production: Understanding E-science. Information Science Publishing (2006).
- Kemman, M. & Kleppe, M.: User Required? On the Value of User Research in the Digital Humanities, Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands, pp. 63–74 (2015).
- Edmond, J. and Garnett, V.: APIs and Researchers: The Emperor’s New Clothes?, International Journal of Digital Curation, 10(1), pp. 287–297 (2015).
- Haythornthwaite, C., Lunsford, K. J., Kazmer, M. M., Robins, J. and Nazarova, M., The generative dance in pursuit of generative knowledge, Proceedings of the 36th Annual Hawaii International Conference on System Sciences (2003).
- Karasti, H.: Infrastructuring in participatory design, Proceedings of the 13th Participatory Design Conference on Research Papers - PDC ’14, pp. 141–150 (2014).