See also the showreel of completed and ongoing research at http://j.mp/hssci.
Lead by Eetu Mäkelä, the HSSCI research strand sits at the intersection of digital humanities, computational social science, interaction design and data science. In our opinion, applying modern data processing to complex social and historical data works best when done in collaboration - between the social scientists/humanities scholars who have the questions, between the computer scientists who deeply understand the methods and between the institutions who own and best understand the data used .
At its best, collaboration also has something unique to offer each of these groups inside their own field of study: scholars in the humanities/social sciences are able to tackle questions too labour-intensive for manual study, computer scientists encounter new and challenging use cases for their tools and algorithms, and institutions gain valuable insight into, and feedback on their own data collections and the way they present them.
In this vein, we’ve set out to map and attack the gap between the needs of the interpretive process of inquiry inherent to the humanities and social sciences, and existing computational tools. In particular, we’ve identified the following current shortcomings:
- Primary sources and databases in the humanities, or e.g. online forums used by social scientists often do not contain directly the information of interest to the scholar. Thus, research by necessity includes large amounts of interpretation and inference to produce the final data conclusions are based on. Current tools mostly do not support this, instead only providing quantitative views on top of the original data [2, 3].
- Tools need to support contextualization , and fluidly moving between close and distant reading [5, 6] – due to the complexity of the data as well as the process of interpretation, no aggregate view of data can be trusted without the possibility to look deeper into what goes in the numbers. On the other hand, data-based contextualization on the level of close reading allows discovery of related information and patterns that might have otherwise been missed.
- Tools also need more nuanced handling of the ever-present, multiple types of uncertainty, incompleteness and contradictions inherent in humanities and social data [3, 7–11]
- Available datasets have often been created for other purposes. Coming against a new dataset, it is difficult to determine if information relevant to an enquiry really is there and how it is encoded. But even more importantly, data also often arises from a long process of selection and transformation (e.g. in the humanities, from a set of physical books that remain, through selective microfilming and automatic transcription or manual curation, to a dataset or in the social sciences, through opportunistic and thus incomplete mining of social media). All of these steps may incur bias due to uneven sampling, changing descriptive practices or algorithms. Without facilities for detecting and handling such bias, any results based on the material may be faulty [12–15].
- No single tool can encompass the whole variety of different types of datasets, nor of different research needs and processes. Even in the context of a single project, different data sources of interest often have different types of objects or exhibit different levels of structure, and thus have different needs for processing to yield unified data. Yet, most current tools are rigid and overbearing, and do not allow fluidly moving data between tools.
Turned around, the core research questions currently on the agenda for the research strand are thus:
- How to make complex humanities/social science datasets understandable, and reveal the biases hidden therein
- How to fluidly move between close and distant reading – giving context to individual entries and documents, while also allowing peeking behind any aggregate counts to look at their constituents
- How to handle the ever present uncertainty, incompleteness and contradictions in humanities/social data – this requires developments in terms of a) statistics and algorithms to better handle and quantify this uncertainty, b) visualizations capable of highlighting these various types of uncertainties and c) user interaction to e.g. allow exploring with different possible values
- How to enable scholars to store and ground their interpretations as layers on top of the data, both to enable reuse as well as improve reproducibility. This also has to happen in an easy enough manner that they will actually do so.
- How to develop the tools in such a way as to allow them to fit the needs of a scholarly research process instead of dictating boundaries.
To ensure the tools developed meet the needs of ends users, they are developed in an iterative fashion, in close interaction with humanities and social science projects. A key insight here is that requirements are simultaneously gathered from multiple projects that operate in different subfields, but deal with similarly formatted data. Through this, solutions devised will be forced to transcend individual scenarios, making them versatile enough to enable their later use in new fields with similar workflows.
At present, the main collaboration partners have been chosen mostly from the field of humanities, so that 1) all have ready available big historical data sources of interest, 2) the sources range from fully structured to mostly unstructured and 3) the research questions posed against those sources range from simple to advanced, providing workable avenues of attack. These collaboration projects are listed next, along with their qualities (in addition to these, smaller projects inside the sphere of the social sciences are being monitored for later inclusion).
Reassembling the Republic of Letters is an EU COST Action joining together a total of 186 people from 33 countries interested in studying early modern and Enlightenment social networks through correspondence and prosopographical information. Primary contacts for collaboration in the network are the University of Oxford (Howard Hotson) who leads the action, and the Humanities + Design laboratory at Stanford University (Nicole Coleman and Dan Edelstein). Here, data of interest is mostly in structured form, but institutionally and geographically spread out, as well as with varying data models. A sampling of different types of datasets to which access has been obtained includes Early Modern Letters Online, Six Degrees of Francis Bacon, French Book Trade in Enlightenment Europe , the French national bibliography, the English Short Title Catalogue (ESTC), the Oxford Dictionary of National Biography and Wikidata. Examples of data-centric research queries would be: Which backgrounds were people going on a Grand Tour from? What did Samuel Hartlib’s network of connections look like? Which cities published the most philosophy in French? Do the stated places of publication correspond with actual places of publication?
A similar project that also deals in mostly structured data but in a different domain is the Mapping Manuscript Migrations transatlantic partnership, which seeks to track the movements of medieval and renaissance manuscripts. The partnership is again headed by the University of Oxford (Toby Burrows, Kevin Page, Pip Willcox), but also includes the University of Pennsylvania (Lynn Ransom), the Institut de Recherche et d’Histoire des Textes (Hanno Wijsman), and Aalto University (Eero Hyvönen). Here, structured data comes from many manuscript collections, among them the Schoenberg Database of Manuscripts, databases of the Bodleian library, and the French Bibale and Medium. Examples of research queries are: where have manuscripts from the Phillipps collection ended up? Where did they come from? How many hands did they pass through?
Expanding from purely structured data into large collections of unstructured text, the COMHIS project at the University of Helsinki (Mikko Tolonen) seeks to analyze quantity and content of early modern public communication in 18th century Britain and 19th century Finland respectively. This is to be done both through metadata, including the ESTC, Fennica and Kungliga bibliographies, as well as through full text collections such as Early English Books Online (EEBO), Eighteenth Century Collections Online (ECCO), and newspaper collections including British Library Newspapers (BLN) and the Finnish national library newspaper collection. Example queries are: what were the quantity ratios between newspaper and book publishing in different cities at different times? What about different book sizes? Which places published most philosophical literature? Which discourses is the term “public” associated with?
Finally, a different need for combining unstructured and structured data comes from the STRATAS project (Terttu Nevalainen, University of Helsinki), where social metadata associated with the Corpus of Early English Correspondence (CEEC) is being interfaced with the texts themselves in an effort to study the social aspects of historical language variation and change. Here, in addition to CEEC, also other historical full text materials are useful as comparisons, such as EEBO, ECCO and the BLN, Burney and Nichols newspaper collections. Further, structured sources for historical language information need to be queried, such as the Oxford English Dictionary and the Historical Thesaurus of English. Example queries are: when did the modern way of spelling cannot together originate? Which social group adopted it first? Do men or women create new words more often? What is the influence of social class on this?
- Watts, D.J.: Computational social science: Exciting progress and future directions. The Bridge on Frontiers of Engineering. 43, 5–10 (2013).
- Hill, M.J.: Invisible interpretations: reflections on the digital humanities and intellectual history. Global Intellectual History. 1, 130–150 (2016).
- Drucker, J.: Graphical Approaches to the Digital Humanities. In: A New Companion to Digital Humanities. pp. 238–250 (2015).
- Mäkelä, E., Lindquist, T., Hyvönen, E.: CORE - A Contextual Reader based on Linked Data. In: Proceedings of Digital Humanities 2016, long papers. pp. 267–269. , Kraków, Poland (2016).
- Jänicke, S., Franzini, G., Cheema, M.F., Scheuermann, G.: On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges. In: Borgo, R., Ganovelli, F., and Viola, I. (eds.) Eurographics Conference on Visualization (EuroVis) - STARs. The Eurographics Association (2015).
- Uboldi, G., Caviglia, G.: Information Visualizations and Interfaces in the Humanities. In: New Challenges for Data Design. pp. 207–218. Springer, London (2015).
- Wang, X., He, Y.: Learning from Uncertainty for Big Data: Future Analytical Challenges and Strategies. IEEE Systems, Man, and Cybernetics Magazine. 2, 26–31 (2016).
- Rowe, W.D.: Understanding Uncertainty. Risk Anal. 14, 743–750 (1994).
- Lustick, I.S.: History, Historiography, and Political Science: Multiple Historical Records and the Problem of Selection Bias. Am. Polit. Sci. Rev. 90, 605–618 (1996).
- Plewe, B.: The Nature of Uncertainty in Historical Geographic Information. Trans. GIS. 6, 431–456 (2002).
- Pink, S., Ruckenstein, M., Willim, R., Duque, M.: Broken data: Conceptualising data in an emerging world. Big Data & Society. 5, 2053951717753228 (2018).
- Boyd, D., Crawford, K.: CRITICAL QUESTIONS FOR BIG DATA. Inf. Commun. Soc. 15, 662–679 (2012).
- Kaplan, R.M., Chambers, D.A., Glasgow, R.E.: Big data and large sample size: a cautionary note on the potential for bias. Clin. Transl. Sci. 7, 342–346 (2014).
- Lazer, D., Kennedy, R., King, G., Vespignani, A.: The Parable of Google Flu: Traps in Big Data Analysis. Science. 343, 1203–1205 (2014).
- Cihon, P., Yasseri, T.: A Biased Review of Biases in Twitter Studies on Political Collective Action. Frontiers in Physics. 4, 34 (2016).