How to Deal with Data: Introduction

Terttu Nevalainen, Research Unit for Variation, Contacts and Change in English, University of Helsinki
Susan M. Fitzmaurice, School of English Literature, Language and Linguistics, The University of Sheffield


Most of the volumes so far published in the Studies in Variation, Contacts and Change in English can be classified as corpus linguistics, and several forthcoming volumes in the series are informed by corpus linguistic thinking. The contributions to this volume are solidly empirical, and most of them indeed refer to corpora, but they also broaden the use of digital data sources in linguistic studies in various ways. This issue is thus devoted to showcasing how different approaches to dealing with data can provide ways to explore a range of different questions about language use across time and space. We are interested in examining the nature of the problems and challenges that kinds of material not ordinarily treated as linguistic data sources pose for the researcher. We will consider how these kinds of material might be considered as potential sources of data for analysis and interpretation.

The questions addressed include: what counts as data in linguistic research, how can users of corpora and digital databases learn more about their data, and how can several media be integrated in research? The authors introduce novel data sources, analyse multimodal corpora, and describe text corpora in visual terms. These resources and approaches invite new research questions and enhance the assessment and interpretation of the results obtained. The authors also seek to demonstrate that drawing upon the techniques and methods adopted in allied or sister disciplines opens up different perspectives on the material considered to be data for linguistic analysis.

1. Growing variability of data sources

Corpus linguistics had a slow beginning but its subsequent growth has been phenomenal. Back in the early 1960s, a small group of linguists realized the research potential of electronic text corpora in the humanities, and the Brown Corpus, compiled under the direction of W. Nelson Francis and Henry Kučera, was published in 1964. Now the use of electronic corpora is the rule rather than the exception in many fields of linguistic inquiry, and corpus linguistics constitutes a methodology that is frequently viewed as a branch of linguistics of its own. [1]

Apart from corpora, a wide selection of digital archives and databases has become available in recent years. These new resources consist of pretty much everything available in digital form, including the internet itself (see e.g. Kehoe & Gee 2007). More finite sources such as printed books and newspapers are also being tapped not only by linguists but also by researchers in other fields across the humanities, and social and formal sciences (e.g. Michel et al. 2011). Linguistic interests in these new resources are similarly wide-ranging, from applied to computational linguistics or, as in this volume, from finding evidence for language change in its social contexts to reconstructing linguistic usage that crosses traditional geographical and discourse boundaries.

The “digital turn” in the humanities has enabled the use and creation of less conventional data sources for research purposes. The contributions to this volume discuss both purpose-built and nontraditional sources, highlighting their research potential and seeking solutions to challenges inherent in analysing digital data for linguistic purposes.

2. Access and stability

Established and innovative data sources may differ in several respects that influence the work carried out using them. While corpora can in principle be open ended – monitor corpora, for example, are updated regularly (Davies 2010) – the use of stable corpora facilitates data sharing and enables the easy replicability of research results. The idea of stable corpora lies behind corpus archives such as the Oxford Text Archive (OTA), which “develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning”. Corpora are similarly the building block in integrated research infrastructure projects such as CLARIN (Common Language Resources and Technology Infrastructure). The mission of CLARIN states that the resources and services planned are to be persistent, i.e. “accessible for many years so that researchers can rely on them”.

There are also stable databases that do not constitute corpora in the traditional sense but have been compiled, for example, with historical rather than linguistic research in mind. A case in point, discussed by Philip Shaw in this volume, is coin inscriptions, which can offer useful information for the study of the languages used in Anglo-Saxon England.

However, some of the new resources introduced by both Susan Burnes and Susan Fitzmaurice do not represent finite data sets but rather a selection of research-relevant material retrieved from the ever-increasing body of data on the internet such as electronic newspapers. Many newspapers will be archived and available for later use (for example, via ProQuest, see the ProQuest archive of Historical Newspapers: The Guardian (1821-2003) and The Observer (1791-2003)). However, readers’ comments on the content and editorial perspectives presented in op-ed articles that are posted on the internet as material specific to a newspaper’s online edition are far less likely to be preserved as a cumulative resource. Indeed, this highly interactive register, unique to online environments, is peripheral to the editorial content of the news it accompanies. At the same time, it is a new source of speakers’ communicative behaviour and habits, and as such a potentially rich new data source. The highly contingent nature of these more ephemeral data sources means that the research carried out using them must take place more or less in real time, and that researchers need to create their own archives.

Spoken language constitutes a different kind of access issue in that it is typically reduced to the written medium in corpora and other digital databases. The reasons for this are partly practical: to facilitate searches by enabling the researcher to use the same set of tools to retrieve data from mixed-medium corpora. However, copyright and personal privacy issues may prevent the inclusion of actual spoken data in corpora, or these concerns lead to the need to anonymize the spoken data (Hasund 1998).

The spoken record is emphasized by Simo Ahava, who discusses the procedures that need to be adopted to combine sound and text in an electronic corpus. A key issue that emerges in the regional dialect material he explores is how to solve the basic conflict between the primacy of the audio material and the ease of data retrieval from written text. He notes in a striking comment that the Helsinki Archive of Regional English Speech (HARES) and the Helsinki Corpus of British English Dialects are “both only interpretative devices intended to make best use of the audio by whatever technological means have been available during corpus compilation”. He points to a major problem in the analysis of spoken data that could be extended to other data sources; this is the tendency of researchers to treat the actual data source as secondary to the processing of those data. For example, in constructing a corpus of spoken language for reading rather than auditory access, it is possible to accord primary importance to the transcription of the data in place of easy access to the spoken data of which the transcription is simply (but not unproblematically) a record.

3. Towards more visual resources

Despite their differences in media, structure, mark-up, preservation, and availability, there are a number of issues shared by research based on empirical sources. Even structured corpora are seldom fully balanced. This is typically the case with diachronic corpora, where data have been preserved randomly and can undergo significant changes over time. In order to be able to interpret their findings, corpus users need to know not only the structure of their corpus but also the distribution of the corpus data over time. As shown by Harri Siirtola, Terttu Nevalainen, Tanja Säily and Kari-Jouko Räihä, one way to enhance the use of historical corpora is to explore their extralinguistic composition and linguistic structure in visual terms. These techniques, applied to the Parsed Corpus of Early English Correspondence (PCEEC), range from mosaic, scatter and bean plots to two-dimensional density plots and tag clouds. They are found particularly useful as means of exploring the diachronic distribution of the kinds of data included in a corpus before its linguistic details are studied in more detail.

Visualization techniques can similarly be used to show the metadata included in a database or a corpus, and these metadata are relevant for the interpretation of particular research questions. Maps, for example, provide a means of anchoring the data sources investigated in their localities of origin. Philip Shaw draws maps to show the regional scatter of the Anglo-Saxon coin finds he analyses, and so does Simo Ahava to indicate the provenance of the spoken data in the HARES corpus. Maps are similarly introduced by Joan Beal and Karen Corrigan to locate their regional corpora of Tyneside English and by Nuria Yáñez-Bouza in her work on the provenance of English 18th-century grammars and grammar writers. Beal and Corrigan explore how the varying effects of time and place on language use can be retrieved from electronic databases, and Yáñez-Bouza is concerned with these issues in relation to descriptive and prescriptive comments on language use.

Maps lie at the heart of Chris Montgomery’s work on perceptual dialectology, which explores to what extent lay people can systematically recognize regional accents. The study is based on an experimental procedure which produces a rich data source the researcher needs to come to grips with. To create visual representations of the data generated by his experiments, Montgomery develops a set of techniques of drawing “starburst” charts. These visualization techniques considerably facilitate the tasks of data exploration and estimating factors like the magnitude of accent placement errors.

4. New data, different questions

The identification in new places, objects and media of materials that can yield fresh data for linguistic research raises the possibility of asking new and different kinds of questions about language. A number of the papers pursue this possibility. The exploration of the pictorial interpretation of linguistic datasets, as exemplified in the contributions by Shaw, Montgomery and Siirtola, Nevalainen, Säily and Räihä, raises fresh questions along the way. For example, Philip Shaw’s examination of the range of spellings used to represent the personal names of kings in their coinage reveals that moneyers very likely had preferred spellings to distinguish their coinage from that of other moneyers. This finding raises the prospect of investigating the roles of agency and ownership in using variation in name representation in order to mark origin and provenance of a material object.

Chris Montgomery’s starburst charts, which capture visually the difference between where people locate dialects and their actual geographical provenance, invite the exploration of how far these data can illuminate connections between informants’ attitudes to dialects with their apparent sense of geography. This kind of investigation necessitates the mapping of special relations onto cognitive and affective relations. The visualization techniques explored by Harri Siirtola, Terttu Nevalainen, Tanja Säily and Kari-Jouko Räihä allow the researcher literally to see gaps as well as patterns in the materials that are the basis for research data and potentially explore their effects on the research designed in consequence. Simo Ahava draws attention to variant forms normally edited out from corpus transcripts, and shows how an audio corpus can reveal new structural patterns in spoken dialect grammars.

Joan Beal and Karen Corrigan subject established electronic corpora developed for particular sociolinguistic projects to fresh and new investigations of language variation and change. Specifically, they mine the databases for comparative evidence of syntactic variation despite the great variability of the size, structure and materials of the databases. This work demonstrates that data may be created out of diverse linguistic materials.

Sue Burnes’ investigation of the types of metaphor deployed in British and French newspaper coverage of elections departs markedly from mainstream corpus linguistic analyses. She constructed a database of conceptual metaphors with French and English examples from the newspapers as a basis for exploring how metaphors map between specific events, such as political elections, and the broader notion of conflict.

Nuria Yáñez-Bouza tackles the question of the geographical reach and impact of the normative grammar craze in the second half of the eighteenth century from a novel perspective. She examines the distribution of book publication in order to question the impact of geography on language change.

Susan Fitzmaurice’s study of the linguistic assertion of national identity by commentators in response to online news stories raises the question of how to tackle the investigation of the relationship of language and speaker variables when the linguistic data cannot be reliably tied to their producers. It is possible that in treating internet language as data, the traditional social variables used in sociolinguistic study need to be effaced by the discourse and ideological environment in which the language is produced. In short, it forces us to re-evaluate the importance of the notion of authenticity as far as linguistic data is concerned.


Some of the ideas and data explored in a number of the papers collected in this volume were first shared at a one-day conference on data, funded by the Leverhulme Trust in 2008 at the University of Sheffield. That conference was the culmination of Terttu Nevalainen’s semester-long visit to Sheffield and collaboration with Susan Fitzmaurice sponsored by the Leverhulme Trust. The activities during that busy semester included intensive postgraduate and research seminars at which the problems of data in the study of language change took centre stage. Terttu also delivered a series of lectures on provocative questions of linguistic analysis and research methodology, inspiring the development of this volume. We are pleased to be able to present some of the results of that visit together in eVarieng, a forum that reflects the commitment of the researchers to meeting the challenge of exploring new kinds and sources of data.


[1] For some of the ongoing debates in the field, cf. the special issue of the International Journal of Corpus Linguistics 15:3 (2010).

Links [site no longer available, see]


Davies, Mark (2010). “The Corpus of Contemporary American English as the first reliable monitor corpus of English.” Literary and Linguistic Computing (2010), doi: 10.1093/llc/fqq018.

Hasund, Kristine (1998). “Protecting the innocent: The issue of informants’ anonymity in the COLT Corpus.” In: Antoinette Renouf (ed.), Explorations in Corpus Linguistics, 13-28. Amsterdam & Atlanta: Rodopi.

Kehoe, Andrew & Matt Gee (2007). ”New corpora from the web: making web text more ‘text-like’”. In: Päivi Pahta, Irma Taavitsainen, Terttu Nevalainen & Jukka Tyrkkö (eds.), Studies in Variation, Contacts and Change in English, Volume 2: Towards Multimedia in Corpus Studies.

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, William Brockman, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden (2011). “Quantitative analysis of culture using millions of digitized books.” Science, 14 January 2011, Vol. 331, 176-182. Doi: 10.1126/science.1199644 (Published online ahead of print: 12/16/2010).