Studies in Variation, Contacts and Change in English
Volume 14

Principles and Practices for the Digital Editing and Annotation of Diachronic Data

Edited by Anneli Meurman-Solin, Research Unit for Variation, Contacts and Change in English (VARIENG), and Jukka Tyrkkö, University of Tampere

Publication date: 2013


Part I: On the evolution of three major corpus projects in English historical linguistics

Rissanen, Matti & Jukka Tyrkkö
The Helsinki Corpus of English Texts (HC)

The compilation of the Helsinki Corpus of English Texts (HC) was initiated in the early 1980s and the corpus was completed and publicly distributed in 1991. Its size is c. 1.5 million words, and it covers the periods from Early Old English to the end of Early Modern English, i.e., to the beginning of the eighteenth century. The Corpus is structured chronologically and by sociolinguistic, dialectal and genre-based parameters. After two decades, it is still used in various parts of the world as a “diagnostic” corpus giving useful indications of the first thousand years of the development of the English language. The results given by the Helsinki Corpus can be easily supplemented from other larger and/or more focused historical corpora compiled in Helsinki and elsewhere.

Nevala, Minna & Arja Nurmi
The Corpora of Early English Correspondence (CEEC400)

The Corpus of Early English Correspondence started as one 2.6-million-word corpus and has over the past two decades developed into a corpus family covering 400 years of English letters from the beginning of the fifteenth century to the end of the eighteenth. CEEC400, as the current version is called, contains 5.2 million words of personal correspondence written by Englishmen and Englishwomen from all the literate social ranks. The corpus has been designed to be as socially representative as possible, and variables taken into account in compilation included writer’s gender, social status, and geographic origin, as well as the relationship between writer and recipient. While based on edited letter collections, the compilation team has endeavoured to discover reliable editions presenting the language of the original manuscripts in as authentic a way as possible. The corpora are accompanied by a sender and recipient database, which contains information about the social backgrounds of the informants as well as a letter database recording details of each letter. Some of this is also coded into the corpora (e.g., relationship between writer and recipient). A part of the corpus (PCEEC) has been linguistically annotated and released for the use of the wider scholarly community.

The research carried out using the corpus started with the basic question: is it possible to apply the methods of present-day sociolinguistics to historical data? Once this question was answered in the affirmative, the scope of research questions could be widened. Within the compilation team approaches have ranged from stratificational and interactive sociolinguistics to socio-pragmatics, but the corpus has also provided suitable material for more linguistically focused studies by scholars around the world.

Taavitsainen, Irma & Päivi Pahta
The Corpus of Early English Medical Writing (1375–1800) – a register-specific diachronic corpus for studying the history of scientific writing

This article deals with the Corpus of Early English Medical Writing (CEEM) that consists of Middle English Medical Texts (MEMT), Early Modern Medical Texts (EMEMT), and Late Modern English Medical Texts 1700-1800 (LMEMT). CEEM is a register-specific diachronic corpus, designed to serve as research material for the diachronic study of English as the special language of science and medicine in its disciplinary embedding. The corpus contributes to the new digital infrastructure for linguistic research by making available a carefully selected body of texts. The first two sub-corpora have been publicly available for a number of years, and have already proved useful not only for linguistic research but also for philological, medical and cultural study of the register. CEEM reflects a broad view of science and medicine with fuzzy borders as it takes the medieval, early modern and late modern views into account. MEMT contains a great deal of material that was earlier unknown to researchers and pushes the timeline of English scientific writing back to the large pan-European boom of the vernacularization of science that began in the late medieval period (c. 1375- ); traditionally the beginnings of English scientific writing were placed to the Royal Society period in the seventeenth century.  EMEMT introduces several novel features to corpus design and makes contextualization and multimodal approaches possible. It also includes normalized versions of texts for the application of advanced corpus linguistic tools. In the late modern period the volume of available material increases considerably. This poses an interdisciplinary challenge and collaboration with medical historians is necessary to ensure representativeness of corpus data. LMEMT will follow the principles set by earlier work and provide a representative corpus of c. 2 million words.

Part II: Material features in manuscripts. Correspondence and trial proceedings in focus

Meurman-Solin, Anneli
Visual prosody in manuscript letters in the study of syntax and discourse

The study draws on the author’s experience as the compiler and annotator of the Corpus of Scottish Correspondence (CSC), which comprises royal, offical, and family letters dating from 1500–1715 written by informants representing the various areas of Scotland. It defines the concept of “visual prosody” and illustrates how information about such apparently purely non-linguistic features as variation in the realisation of punctuation marks, in spacing, and in character shapes may be indispensable in the linguistic analysis of historical documents. The discussion sets off with illustrations of how the editing principles and practices have changed since the major collections of Scottish correpondence were published as part of family memoirs in the nineteenth century. Since the CSC corpus consists of diplomatically transcribed original manuscripts exclusively, the data permits us to see that there is much more significant information available, for example in utterance boundaries, for reconstructing patterns of syntax and discourse, even though the widely spread practices of digital editing applied to the majority of modern corpora have left it unannotated. The study suggests that a closer examination of features of visual prosody may have important implications on how we write the grammars of prose based on methods of philological computing.

Walker, Terry & Merja Kytö
Features of layout and other visual effects in the source manuscripts of An Electronic Text Edition of Depositions 1560–1760 (ETED)

An Electronic Text Edition of Depositions 1560–1760 (ETED) comprises faithful manuscript transcriptions of 905 depositions from various regions in England, presented in a collection of searchable computer files in different formats. In this article we first briefly describe the features of layout rendered in ETED and then focus on those that have not been represented in the edition. Among the features investigated are the use made of empty space on the manuscript pages, space lines, alignment, different types of indentation, and features of handwriting, e.g. the use of font changes and large and/or embellished letters.

We distinguish two types of depositions in ETED, church court documents and criminal court documents. The former were copied down in the locality of the court in question at various stages of the court process while the latter were compiled in different localities and sent up to the appropriate local, county or regional court. We show that this difference is one of the factors influencing the choices that scribes made in their layout practices. We also show that while there is a great deal of variation in the visual effects used, it is also clear that scribes aimed at distinguishing the different components of the depositions and highlighting important information, e.g. the date of the document and the names of the parties involved. Finally, we comment on the benefits of editors coding for layout features in the interest of computerized searches and other research purposes.

Sairio, Anni & Minna Nevala
Social dimensions of layout in eighteenth-century letters and letter-writing manuals

In this article, we present four case studies which explore the influence of letter-writing manuals in private letters written in eighteenth-century England. We focus on the ways in which layout features and contemporary instructions may be considered to reflect the social relationship between the writer and the recipient, and on the aspects of gender and social status in particular. We are particularly interested in the use of space and the positioning of the different parts of the letter (the salutation, the body of the letter, and the subscription). Our purpose is to look at such features as the correlation between the layouts and formal instructions; the correlation between the use of paper and the social status of the recipient; and the overall influence of a relationship between the writer and the recipient on layout choices in eighteenth-century letters.

The sample of four letters dates from the 1740s to the latter half of the 1780s, and the analysis is based on photographs and transcripts made of the original manuscripts in the Montagu family papers. The writers and the recipients belong to the same social network, in which Elizabeth Robinson Montagu (1718–1800), a well-educated literary hostess of the time, appears to have been a central figure. The letters chosen for the analysis consist of one written to Montagu in her youth by her close friend Lady Margaret Bentinck, the Duchess of Portland in c. 1742, a letter Elizabeth Montagu wrote to her husband in 1757, a letter which a genteel Bluestocking woman wrote to a male aristocrat and fellow Bluestocking in c. 1771, and a letter Elizabeth Montagu received from her heir and nephew in c. 1786.

The study shows that letter-writing manuals promoted the recognition of variability in the status of the correspondents. Instructions given for written communication were particularly sensitive to changes within the social hierarchy. The four eighteenth-century letter-writers were undoubtedly fully competent in what Whyman (2009) refers to as epistolary literacy. Perhaps some of the rules were outdated or otherwise ignored, as social groups formed their own conventions. But some of the divergence in the material seems to be conscious. A close relationship between letter-writers clearly overrules certain norms of correspondence in the eighteenth century, which would be a new development from the previous centuries.

Meurman-Solin, Anneli
Features of layout in sixteenth- and seventeenth-century Scottish letters

The study illustrates some features of layout in the letters of the Corpus of Scottish Correspondence (CSC), which comprises royal, offical, and family letters dating from 1500–1715 written by informants representing the various areas of Scotland. The focus is on such material features of letters as size, ranging from a short note to multi-page reports or narratives, and the general layout, for example, the width of margins. Somewhat more extensively, the study discusses the positioning of the conventionalised components of epistolary prose, such as terms of address, time and place of writing, and greetings to other members of the addressee’s family or other networks. Since the CSC is based on diplomatically transcribed original manuscripts of letters, it also provides valid data for studying one of the most intriguing visually observable features, the use of spacing in various functions. Even though there is a lot of information about the textual and social significance of spacing, frequently also referred to in contemporary letter-writing manuals, the CSC data shows that further study is required to fully understand the complex interplay between the use of socially motivated signs of politeness and deference and the evolution and establishment of a particular register of writing.

Part III: Paratextual properties in early printed title-pages

McConchie, R. W.
Some reflections on Early Modern printed title-pages

Not all that appears in a printed book is the creation of the author, least of all the title-page, over which the author may have no control at all. My chapter deals with elements of the title-pages of earlier printed books, and the way in which they might be outright language elements, relate to language in some way, or suggest a linguistic meaning or interpretation. A complex of features is employed by publishers and printers, ranging from words to typography and to images, all of which might find a place in a taxonomy of such features. While title-pages are often very conventional in their presentation, they also introduce a new and often unique work to a reader, and are thus intended to have an impact beyond that of a typical page of the work they introduce. Not only that, owners of books also leave their own individual markings and annotations on title-pages particularly. This chapter surveys at least some of these features, suggesting the ways, both obvious and subtle, in which they convey meaning. The corpus annotator who wishes to account for more than simply words as linguistic items and syntactic structures will need to be sensitive to the paratext of a book.

Ratia, Maura
Investigating genre through title-pages: Plague treatises of the Stuart period in focus

The present article deals with the genre of plague treatises of the Stuart period. The focus is on title-pages and how information about the genre can be gained by investigating them. In the study of historical texts and genres, title-pages as well as other paratextual elements have only recently started to attract attention from scholars. Plague treatises are intriguing as they typically combine elements from both medical and religious writing. I chose to look at texts that contain religious discourse in their title-page to see how it correlates with the content. Qualitative assessment of the material was done by analysing textual labels featured on the title-page and the type of discourse these labels referred to. The results showed that textual labels, especially with regard to defining medical content, were quite accurate. In contrast, religious argumentation was at times only subtly advertised. The linguistic analysis was complemented by the visual, i.e. examining highlighted items on the title-page. The most prominent item was the topical label plague, but primary textual labels were also highlighted which suggests that textual labels were considered important. In contrast, headline titles and religious genre labels were generally not highlighted, and were thus regarded as secondary or additional elements in the text and in the genre of plague treatises.

Part IV: New approaches to digital editing

Honkapohja, Alpo
Manuscript abbreviations in Latin and English: History, typologies and how to tackle them in encoding

This article discusses the theoretical and practical problems related to encoding manuscript abbreviations in TEI P5 XML. Encoding them presents a challenge, because the correspondence between the orthographic sign indicating abbreviation and what the sign stands for is more complex than in non-abbreviated words. The article consists of a review of the terminology used to describe the abbreviations, looking at their history from antiquity to abolition and taxonomies of abbreviations in paleographical handbooks between 1745 and 2007. It discusses the editorial treatment of abbreviations in printed editions and relates them to the terminology used in the handbooks, offering criticism of it from a linguistic and editorial point of view and how to best represent the abbreviations in TEI P5 mark-up. Traditional taxonomies of abbreviation divide the abbreviations into groups based on the shape of the abbreviating symbol or the position of the abbreviated content. Some of the distinctions, such as the one between contractions and suspensions are not at all relevant for digital encoding. However, the system outlined in this article allows for tagging them in a way which will enable quantitative corpus study of them. The data comes mainly from a digital edition of The Trinity Seven Planets, a TEI P5-based digital edition.

Meurman-Solin, Anneli
Taxonomisation of features of visual prosody

The study suggests that we require an annotation theory and an annotation language for a wider range of features of texts than those traditionally annotated in digital corpora. The focus is on observations the writer made in the long process of editing digitally and annotating original manuscripts of Scottish letters dating from 1500–1715. While, for example, the identification of linguistically significant features of visual prosody is a challenge as such, a valid reconstruction of their variational space in synchrony and diachrony and the creation of a variationist taxonomy for annotating them is an exercise in which corpus linguists do not yet have a lot of experience. Such simplistic dichotomies as the polarisation between “default” and “marked” reflect the tendency to let frequency or distinctiveness influence the annotation language; yet concepts such as these are too crude to be theoretically valid or pragmatically useful. After discussing options such as taxonomies based on statistical salience or prototype theory, the study argues for three taxonomisation principles in the annotation of paratextual features: the parameter values are purely descriptive; they are dynamic in the sense that the values are sensitive to variation and change even over a long time-span; and the values contain information provided by cross-disciplinary research. Thus, the theoretical validity of a taxonomy is questionable if description and interpretation are intermingled in one way or another; no feature can be assumed to be a permanent member of a particular category, switching category being a well-documented phenomenon; variables depicting the continuum of wider contexts relevant in the history and production circumstances of a particular text permit a valid interpretation of the findings. In taxonomies of this kind, the annotation system is based on ordering the descriptive components into strings of properties. A system like this faithfully records the inherent potential for variation and change within each component and also reflects patterns of interrelatedness between components.

Claridge, Claudia
From page to screen: The relevance of encoded visual features in the Lampeter Corpus

Visual aspects of texts are potentially relevant for both textual interpretation and for general linguistic insights. Contemporary evidence shows that Early Modern English writers and printers were aware of this, and thus may have taken special care with at least some of these aspects. The visual aspects treated here, as represented in the Lampeter Corpus, are titlepage layout and typography, typographical changes in running text, in particular the change to blackletter type, and end-of-line hyphenation or word separation. While the latter two are reliably retrievable from the Lampeter Corpus annotation, page layout cannot be comprehensibly studied on the basis of the existing corpus annotation scheme.

Tyrkkö, Jukka, Ville Marttila & Carla Suhr
The Culpeper Project: Digital editing of title-pages

With a few notable exceptions, traditional corpus linguistic methods have focused on linguistic rather than peritextual features, while the enterprise of digital editing has paid more attention to the latter. Over the last decade, the rising prominence of the field of digital humanities has served to bring together these two disciplines by means of common annotation schemata and, more crucially still, has fostered a renewed sense that a comprehensive understanding of the primary source is useful to both linguists and archivists. Today, by means of careful digital editing, many contextual and co-textual features of the artefact can be recorded in a searchable and quantifiable format along with the text itself.

This article presents a TEI XML-based system of peritextual annotation developed by the authors as part of the Gatekeepers of Knowledge project. In addition to discussing the annotation model in some detail, we present some of the first findings of a pilot study on the title-pages of books by the seventeenth-century medical author Nicholas Culpeper. The pilot project will demonstrate the usefulness of the system of annotation and the preliminary findings will support the observation made in earlier scholarship that  Culpeper’s main publishers can be effectively divided into two competing branches.