Series title: Studies in Variation, Contacts and Change in English
Volume 14 – Principles and Practices for the Digital Editing and Annotation of Diachronic Data
Publication date: 2013


Anneli Meurman-Solin and Jukka Tyrkkö

This volume aims to represent a useful and necessary evaluation of the current state of the art when it comes to a corpus linguistic and philological perspective to principles and practices of digital editing. While it offers a window into the evolution of scholarly perspectives in the Research Unit for Variation, Contacts and Change in English (VARIENG) in Helsinki over the last two decades, it also reports on work by highly experienced corpus compilers in some other research communities. Research progresses in ebbs and flows, and it appears that the time has once again come for paratextual features to be included in the study of linguistic. Today, the research questions in this branch of study can be operationalised in terms of searchable metadata and detailed annotation and taxonomisation of visual features, available for large-scale diachronic and synchronic studies like never before. At the same time, however, it is prudent to keep in mind that the new methodologies are best used in a theoretically and methodologically well-documented and transparent fashion. The articles in this volume contribute to that end, highlighting some of the principles the authors and editors have come to consider useful and the practical applications of those principles in digital editing.

The introduction is structured into four parts: Part I (Meurman-Solin) describes the shared goals of the articles on three major corpus projects at the VARIENG research unit; Part II (Meurman-Solin) focuses on the studies which examine a variety of such visual features of historical manuscripts the annotation of which can provide relevant evidence for linguistic analysis; Part III (Tyrkkö) discusses paratextual properties of early printed texts ranging from title-pages to texts representing quite distinctive genre traditions; and Part IV (Tyrkkö) introduces both established and recently launched projects which aim to develop new principles and practices for the annotation of the material features of historical texts in a digitally retrievable way. The shared goal in observing the material features of historical texts is to identify those which are significant in the analysis of linguistic features and therefore should be taxonomised and annotated in databases primarily compiled for language research.

The articles of this volume have been written by scholars who have personally compiled corpora containing ‘unconventional data’ (Beal, Corrigan & Moisl 2007) and are therefore thoroughly acquainted with the complex problems that compilers have to solve in the digital editing and annotation of original historical texts. While dealing with the large degree of variation attested in such texts is a challenge as such, deciding what to include in a digital edition is an equally intriguing problem, seeing, besides what is traditionally seen as language, a wide range of non-linguistic features are also immediately observable in examining any original manuscript or early printed text. Information about these features has only randomly been provided in the existing corpora (cf. however Claridge in this volume).

In the twenty-first century, historians have written quite extensively about historical documents. For example as regards historical correspondence, there is a lot of recent research on the so-called material features of early letters (e.g., Bannet 2005, Barton & Hall 1999, Daybell 2001, 2012, Daybell & Gordon forthcoming, and Schneider 2005; for information on the material readings of early modern culture, see also Daybell & Hinds 2010). However, these studies rarely address the question what linguistic significance their findings might have. Thus, Daybell (2012), the most recent work, does not seem to be aware of the abundant research by linguists in recent years, nor does he mention activities among corpus linguists in the field his own research focuses on. Similarly, although historical linguists working on language data derived from early printed books make passing references to the books themselves, few afford any appreciable amount of attention to the production circumstances of those books or to the paratextual features of the printed copy. Thus, despite the recent emergence of the cross-disciplinary field of digital humanities, it is safe to say that at present the two research communities of linguistics and material text studies are not yet in regular contact with each other, even though they draw on the same data.

Among linguists, the cross-disciplinary approach has been adopted somewhat more widely, as illustrated by studies such as Fitzmaurice (2002), Nevala (2004), Nevalainen (2001), Nevalainen & Raumolin-Brunberg (1996, 2003), Nurmi, Nevala & Palander-Collin (2009), Sairio (2009), Suhr (2011), and Diemer (2012). Also worth mentioning are new projects such as Paratext on the Page at the university of Turku.

The general aim of the volume is to remind historical linguists of the complexity of historical data (see for example Lass 2004), part of that complexity perhaps getting lost in the digitisation process. In the linguistic analysis of historical texts we constantly draw on our knowledge of the evolution of writing practices in the various grammars and genre traditions, including how these are reflected in the material features of writing and printing. The problem is that features reflecting the culture and practices of writing in the original manuscripts and early printed texts are either not reproduced at all in the corpora or remain unannotated in a way that would make this information digitally retrievable.

Now that diachronic corpora provide us with large quantities of data, errors resulting from the misinterpretation of insufficiently contextualised linguistic items may be multiplied in the analysis. In other words, if we edit digitally and annotate the language of texts exclusively and reproduce quite imprecisely, or not at all, non-linguistic features such as layout, script type, and contracted and abbreviated word-forms, the consequences may be serious in both quantitative and qualitative analysis. We think that, in improving diachronic corpora further, a much wider range of material features should be described, taxonomised, and annotated and that sophisticated tools should be developed for retrieving this information as directly linked to linguistic items.   

Since for example limitations of space or technicalities related to the reproduction of historical documents as digital images or the creation of hyperlinks to other online resources do not impose restrictions on the authorial and editorial process, online publishing in the e-series Studies in Variation, Contacts and Change in English provides us with an excellent forum for offering a particularly rich variety of illustrations in discussing the various topics. We hope that these illustrations will offer useful examples not only for researchers but also for students in the field of historical linguistics.

Part I: On the evolution of three major corpus projects in English historical linguistics

The syntheses of three major corpora compiled at the VARIENG research unit at the University of Helsinki draw on the long-time experience of their compilers, digitisers, and annotators, who specialise in creating carefully structured databases representative in long diachrony. The general goal of the articles is to describe what theories and methods the compilation principles and practices are based on and how the compilers’ thinking and technologies have developed over time. The leitmotif in all the articles is the compilers’ awareness of the protean nature of corpora, continuous evaluation, change, and expansion being part and parcel of corpus compilation processes. We hope that the readers will find it useful to study the compilers’ own assessment of how the original goals have been achieved, why particular changes have been made, and what they think can be named as particular advantages or disadvantages of a particular corpus. The readers are also reminded of matters related to time and economy in corpus compilation projects.

The articles provide information about the following properties of the databases:

  • corpus type (e.g. multi-genre/single genre; synchronic/diachronic);
  • size of the database;
  • time period;
  • place or region the data originates from;
  • genre(s) and text type(s);
  • availability (CD-ROM, online, restricted access, etc.) at present;
  • description of where, by whom, and how the database has been used internationally;
  • research literature based on the database.

The reader is reminded of the fact that more detailed information about the corpora is provided by the CoRD site

The critical assessment of each corpus aims to provide information of the following kind:

  • the aim of the corpus project as defined when the compilation process began;
  • comments on this aim and the way it relates to how the corpus has been used;
  • an evaluation of the present relevance of the database in general terms;
  • an evaluation of the representativeness of the database in more detail; a summary of the compilers’ views on where the caveats are in the database; the compilers may also report on how scholars have assessed the representativeness of the database and what other data sources in their view complement it appropriately;
  • comments on comparability between the database and other corpora;
  • an evaluation of the quality of the texts from the perspective of text history;
  • an evaluation of the authenticity of the texts (e.g., non-autograph letters or copies, instead of original letters, reducing validity as data; in early printed works, the printers’ policies and practices affecting the language, etc.);
  • an assessment of how the language-external variables coded into the corpus have succeeded in guiding the corpus users in their interpretation of their linguistic findings (e.g., in corpora structured by language-external variables, there is the risk of presenting claims about the conditioning of genre or gender, even though texts representing a particular genre or female informants as a group may form internally quite heterogeneous data categories);
  • an assessment of how general practices in a particular tradition of writing may affect the data, especially the balance required for a statistically valid account of salient features (e.g., in using correspondence as data, it is necessary to take into account such general practices as the widespread preference of secretary hand in sixteenth-century formal letters written by members of the higher ranks and professionals or the much earlier adoption of italic among circles close to the royal court, since both of these practices have a major influence on the choice of linguistic variants;
  • information about other corpora which usefully complement the database, especially those that have been compiled in recent years;
  • information about forthcoming corpora, those compiled by colleagues in Helsinki or abroad; how these relate to the earlier ones (e.g., the forthcoming one is directly comparable, revised, larger, applies new compilation and digitisation principles, has been improved as regards representativeness by region, fills a gap, is focused on a particular genre, introduces a new annotation system);
  • views on tasks for future work in corpus linguistics and philological computing.

The above structure for describing a corpus may provide useful guidelines for introducing other corpus projects.

In their article Rissanen and Tyrkkö trace the evolution of the Helsinki Corpus of English Texts (c. 750–1700) from the pioneering diachronic corpus structured by a wide range of language-external variables to its updated xml-version, which came out in 2012. Nevala and Nurmi describe the CEEC family of corpora, providing information about the ongoing expansion and annotation of the databases of early English correspondence (1403–1800) and how thorough knowledge of English social history has permitted the research group to create a finely-graded taxonomy of geographically, demographically, and socially relevant variables. Taavitsainen and Pahta give a thorough account of the compilation principles and practices of the MEMT family of corpora of medical texts (1315–1800), showing, for example, how studies based on these corpora have permitted the redating of the development of scientific writing and the understanding of how rich and multi-layered the history of medical writing is, for example, as regards genre and register types.

Part II: Material features in manuscripts. Correspondence and trial proceedings in focus

In discussing a range of non-linguistic features of letter-writing such as materials and tools of letter-writing and the social significance of layout practices and choice of script type, Daybell (2012: 2) uses the umbrella concept of “the material rhetorics of the manuscript page”. His work, based on the examination of over 10,000 manuscript letters (Daybell 2012: 85), highlights the importance of keeping in mind the social practices of letter-writing that, for example, complicate matters related to such highly relevant questions as the identification of authorship. The same writer may use different script types, the social distance between the writer and the addressee or the level of formality or other circumstances influencing the choice; variables such as these also influence the layout of letters (Daybell 2012: 86–95) and, most importantly, their language.

While historians such as Daybell describe the material and social circumstances of letter-writing in great detail, their conclusions are usually not based on a statistical analysis of the findings, nor do they suggest criteria for taxonomising variation they report on. For example, the remarks on the positioning of the place and date of writing in letters are general (see Daybell 2012: 104–105), rather than allow us to trace the evolution in this particular practice from the position as part of the body of the text at the end, to a separate position at the end, and finally to the fixed position on the right at the beginning, a development recorded in the Corpus of Scottish Correspondence (Meurman-Solin b in this volume). Similarly, the weakening of the significance of social signs and their replacement by conventionalised ways of structuring a text according to genre-related expectations is usually not recorded by historians in sufficient detail to permit us to identify patterns of variation and record the pace and direction of change (cf. Nevala 2004, Sairio & Nevala in this volume).

Beside the provision of statistically significant evidence, a major difference between research conducted by historians and by linguists at present is the fact that, among linguists, there is a vivid interest in developing principles and practices for annotating paralinguistic features, so that they can be retrieved in computer-assisted research. By contrast, book historical and other non-linguistic philological scholarship is often more concerned with discussing paratextual features separately from texts, which means that the semantic and pragmatic effects of typography and layout cannot be evaluated in a systematic way.

The articles discussing manuscript data in this volume deal with the visual features of trial proceedings (Walker & Kytö) and correspondence (Sairio & Nevala and Meurman-Solin a and b), all the four articles drawing on the long-time experience of their authors in the transcription, digitatisation, and study of historical manuscripts. Walker and Kytö describe the layout features and visual effects in manuscripts recording depositions presented in church court and criminal court cases, edited digitally and annotated by the Electronic Text Edition of Depositions 15601760 (ETED) team. These recorded deponents’ testimonies represent the various regions of England. The authors have written widely on the texts of the ETED corpus (see the references of this article); as regards the recording procedures of trials applied by scribes, see also Huber 2007.

Walker and Kytö state that the aim of the ETED project was “to produce an edition that was faithful to the manuscript texts insofar as this – within the scope of the project – was technically possible and meaningful for linguistic study while also enabling the edition to function as a searchable electronic corpus”. In their article they provide important information about their principles and practices in the selection of layout and other visual features they have decided to annotate. The readers will certainly find it useful to study the two annotation systems, the use of particular symbols (e.g., a tilde or special fonts used to reproduce characters indicating abbreviations) and the use of editorial comments in angle brackets, the two systems complementing each other very nicely.

There is less variation in the layout features of church court depositions than the criminal court records, as the former were preserved in bound books or bundles, and they were written down by fewer scribes. Moreover, these scribes presumably benefited from model documents and other instructional material. From the perspective of the goal of the present volume, a particularly interesting finding is that, since depositions are utilitarian records, the range of visual devices (for example, indexing and organising information and highlighting the various components of depositions) are different from those attested in correspondence (Sairio & Nevala and Meurman-Solin a and b in this volume) and pamphlets (Claridge in this volume). The illustrations will permit the comparison of the font changes and the use of large and/or embellished characters in the depositions and in title-pages representing other genres (McConchie and Ratia in this volume).

Sairio and Nevala focus on the influence of letter-writing manuals on layout in a small selection of eighteenth-century private letters and examine the social dimensions of the practices reflected, for example, in the use of deferential space and the positioning of such standard components of a letter as the place and time of writing and the signature. Even though the manuals have played an important role in the evolution of this particular genre (e.g., Mitchell & Poster 2007), the authors record variability in how the rules are applied, pointing out education and degree of formality as some of the factors conditioning the writing practices. The norms may be ignored in letters where there is a close relationship between the writer and the recipient.

The findings in Sairio & Nevala can be compared with those in Meurman-Solin’s article on layout in the letters of the Corpus of Scottish Correspondence (CSC).  For example, while according to The Art of Letter-Writing (ALW) (1762), as commented by Sairio & Nevala, “sending greetings in a postscript even to one’s friends might be considered ‘Levity’, or have ‘the Appearance of having almost forgotten them’ (1762: 17),” shows “disrespect and indifference”, this practice is quite frequent in the Scottish letters. In fact, that the conventions in letter-writing change considerably over time is well evidenced by comparing the rules in ALW and the practices in the CSC letters dating from 1500–1715. This is reflected, for example, in leaving a large space between the body of the letter and the signature in numerous sixteenth- and early seventeenth-century Scottish letters, a practice which is much less frequent in late-seventeenth and eighteenth-century letters. The 2007 version of the CSC is shown not to provide sufficient evidence of when and how the structuring of the body of the text into paragraphs evolved, the ongoing extension of the corpus perhaps solving this problem.

One of the central concerns in this volume is to assess and discuss critically which visual features of manuscripts, traditionally left unannotated, can be shown to be quite relevant in linguistic analysis, and frequently even indispensable for producing a valid interpretation of historical grammar and lexis. Meurman-Solin’s study of features of visual prosody, such as punctuation devices, spacing, and marked character shapes, in the CSC corpus highlights their importance in the identification of structures of syntax and discourse. The study illustrates normalisation and modernisation practices recorded in earlier editions and compares them with the principles and practices of philological computing. Modernisation was considered acceptable quite widely due to the fact that correspondence was frequently published as part of family memoirs or similar works, and the editors assumed that the great majority of their readers would be either historians or other people interested in letters as historical documents. Meurman-Solin shows that, using diplomatically transcribed original manuscripts as data, also paying close attention to visually observable devices, permits the scholar to write a new grammar of epistolary prose, in principle, every time he or she reads a new idiolect. These idiosyncratic grammars then form the basis for the understanding of how letters were written in the sixteenth to eighteenth centuries as regards the sequencing of chunks of discourse, the choice of connective devices, the ordering of communicative acts, and the positioning of given and new information  (Meurman-Solin 2011, 2013). Further thoughts on how various manuscript features affecting linguistic analysis can be translated into categories are presented in Meurman-Solin’s article on taxonomisation (Meurman-Solin c in this volume).

Part III: Paratextual properties in early printed title-pages

The so-called Printing Revolution started in Europe in the mid to late fifteenth century and the rate at which the new medium was adopted was prodigious, as is well-attested in book historical literature. In England, the first printing press was set up in London by William Caxton in 1476. Over the sixteenth century the number of printing houses increased quickly, as did the availability of books printed in English. The significance of mass-produced vernacular text to literacy and to the development of English is hard to overestimate. Not only did printing hasten the standardization of English spelling (see Tyrkkö forthcoming), but it also introduced a rapidly growing segment of society to what written text looks like. One element of this new written standard was the title-page, a textual innovation of the late fifteenth century described by book historians such as Elisabeth Eisenstein (1979: I, 106) as “the most significant new feature associated with the printed book format”.

For the philologist, printed texts present a whole new set of variables to consider. The challenges associated with the uncertain genealogies of manuscripts are replaced by the equally complicated production circumstances of the early modern marketplace of books. As discussed in detail by authorities such as McKerrow (1967) and Bland (2010), the early modern printing house was a hive of activity with craftsmen of various descriptions working in concert to produce commercially viable products. Some aspects of the printed book, such as the title-page and the illustrations, fell almost completely into the remit of the printer and publisher (see Shevlin 1999: 57). In recent years, more and more attention has been afforded to aspects of the book-production process that have often gone unnoticed especially by historical linguists. For example, increasing awareness of the role played by correctors in shaping the language of printed books will undoubtedly affect studies of Early Modern English that are based on printed sources (see Grafton 2011). The role of the author, traditionally seen as unproblematic, is thus paradoxically at once both central and curiously peripheral. For the linguist studying historical varieties of language, the author naturally stands at the center of interest with all his or her sociocultural, regional, and idiosyncratic characteristics. Against that premise, to know that many of the features seen on the page were in actual fact collaboratively, if not solely, produced by the printing house, begs the question whether it is reasonable to interpret the primary data available to us as representing only the author’s use of language. This broader view of textual history challenges us to realise that only a cross-disciplinary methodological approach may permit us to draw conclusions about the language of texts of this kind.

While paleographers and philologists working on manuscripts have always been keenly aware of the importance of paratextual features, linguists working on printed primary data have typically tended to focus on only the text itself. This may be in part due to the apparent simplicity of printed text, a notion contested by all the relevant contributions in this volume, or, paradoxically, by the complexities of their production circumstances, which often make the correct attribution of responsibility for specific features difficult. The common adoption of corpus linguistic methodologies in historical linguistics has arguably exasperated the issue further by divorcing the corpus edition from the original text, preserving only the basic level of linguistic information and discarding all the features that are, or were, considered too difficult or even impossible to replicate on the computer screen. Accordingly, the two articles in Part III discuss the features of early printed title-pages, focusing in particular on the importance of typography and layout to linguistic and philological studies.

Even more so than choice of type, layout has been largely ignored in the linguistic studies of early printed texts. The article by McConchie begins by noting that while a vast amount of scholarship already exists on virtually all aspects of the history of the book, collaboration between bibliographers and historical linguists has not been particularly far-reaching. To remedy this state of affairs, McConchie discusses the significance of seemingly minute details of layout and typography using seven early modern title-pages as evidence and reveals the wealth of information hidden beneath the most obvious level of interpretation. The central concept here is the illocutionary force of visual features, which McConchie argues is evident in the use of a wide range of typographic features. The examples provided include shifts in type which are shown sometimes to supersede linguistic integrity, as in the case of proper names split into successive lines which undergo a type shift, and the use of blank lines and point size to highlight an author’s name. A digital edition that would omit such visual information would invariably mislead the scholar. McConchie also raises the issue of emblems and other graphic representations, showing that they, too, have great potential significance in the overall analysis of the printed page.

One of the primary functions of the early modern title-page was that of an advertisement. To attract potential customers and to communicate clearly the topic of the book, printers would often follow formulaic and genre-specific traditions in designing title-pages. The contribution by Ratia focuses on the relationship between the overt message communicated by a title-page and the genre of the book it represents, and asks whether this relationship was always a straightforward one or whether specific features may have been used to mislead the customer. For primary data of her case study, Ratia looks at a corpus of 15 plague treatises printed in the seventeenth century. Plague treatises form a distinct subgenre of early modern medical writing, and Ratia defines her primary data further by only including books that appear to convey religious overtones on the title-page. Using lexical choice and the size of type as indicators of affective language use, Ratia’s analysis shows that despite similarities, the texts belong to three distinct discursive types, only one of which is genuinely religious. The study also demonstrates how visual and textual prominence was given to specific lexical items for reason such as emphasis and foregrounding, and that many of these practices were repetitive enough to be regarded as genre-indicating features. For example, while medical key words were afforded visual prominence, religious terms, although present in a secondary role, were clearly downplayed. Ratia’s analysis at once points attention to the significance of typographic and layout features in the comprehensive analysis of the full linguistic sense of the title-page, and demonstrates how such conclusions would be impossible to reach if the researcher did not have access to the full paratext of the original books.

Part IV: New approaches to digital editing

In recent years, the emergent field of digital humanities has served to bring closer the previously distinct disciplines of corpus linguistics and digital editing. This has come about in part through shared computational methodologies and in part by the realisation on both sides that great benefits can be gained if unified methodologies are employed. The articles in the fourth and final part of this volume present new approaches in this area from the linguistic perspective.

One of the traditional concerns of manuscript studies has been the interpretation of abbreviations, abundant as they were in medieval scripts. The related matter of how best to handle such abbreviations in modern editions is the topic of Honkapohja’s article, in which the author takes a detailed look at the digital editing of medieval manuscripts. According to Honkapohja, one of the central challenges in the digital editing of manuscripts is that much of the scholarly terminology and, by extension, of the taxonomies they represent, can be traced back to late nineteenth and early twentieth centuries. The use of these concepts in modern, data-driven corpus linguistic research is inherently problematic. As Honkapohja argues, “This requires ways of representing the data in such a way that it can be quantified and used as reliable evidence, and the traditional paleographical terminology is not necessarily the optimal way for approaching it”.

The study addresses the many theoretical and practical challenges posed by manuscript abbreviations. A historical overview of various types of abbreviations is provided first, followed by a review of the taxonomical treatment of abbreviations in standard reference works spanning 200 years and a discussion of the theoretical positions taken by editors and scholars to the topic. In the second half of the paper Honkapohja presents his own XML-based annotation system, based in part on the model developed by the Digital Editions for Corpus Linguistics project (see Honkapohja, Kaislaniemi & Marttila 2009). Honkapohja’s model for annotating the Trinity Seven Planets manuscripts is presented with considerable attention to detail, addressing both of the two key issues identified by the author, the representation of the sign of abbreviation and the abbreviated content. The model serves both purposes well thereby extending the applicability of corpus linguistic methods to manuscripts.

There are two different but closely related issues that researchers have to deal with when it comes to paratextual features. The first of these is annotation itself, which in the most basic sense refers to the descriptive metadata that will add new layers of searchable and quantifiable data to the text itself. Examples of largely unproblematic layers of annotation include metadata on the document itself and descriptive metadata on the visual and material aspects of the original artifact. The more controversial issue is taxonomisation, by which we mean the division of descriptive data into groups based on well-argued and clearly communicated categories. In corpus linguistics, the latter task in particular has raised objections on the grounds that pre-determined categories may impose particular interpretations and as such guide research excessively. A classic example of a potentially controversial layer of annotation would be part-of-speech tagging, which necessarily requires that the tags are assigned following a specific taxonomy of word classes. On the other hand, it is important to make a distinction between descriptive and interpretative taxonomies, the former cataloguing distinct features of note, the latter assigning them with a semantic, pragmatic, or linguistic significance.

Meurman-Solin’s article on the taxonomisation of features of visual prosody addresses the topic by presenting a compelling set of arguments for why descriptive taxonomies of visual features are useful. Based on extensive experience, the article presents a context-independent theoretical framework for the underlying principles of taxonomy which can be applied to any corpus-linguistic or digital editing project that includes the visual component of the historical documents in question. According to Meurmal-Solin, “The main challenge in annotating visual prosody is how to taxonomise, that is, how to translate a large degree of variation into retrievable variant types without losing information which is relevant from the perspective of the perceived range of research topics”. An important underlying principle of the taxonomical theory Meurman-Solin presents is the claim that “an annotation taxonomy only functions as a tool in data retrieval”. Discussing polarisation, frequency, and membership of a particular discourse or text community as three key dimensions of taxonomy, Meurman-Solin shows that the descriptive system used for visual paratext does not need to, and indeed should not, extend to semantic interpretations of the features’ functions or meanings.

As with manuscript studies, one of the challenges in large-scale empirical studies of early printed paratext has been the lack of systematic descriptions that would allow scholars to access the data in a searchable and quantifiable form. In book historical scholarship, the most natural approach would be to compile a descriptive database such as the comparative database of typographic features of Dutch handpress books compiled by Proot (2012), using established features such as names of type and layout features as variables (or fields) and physical measurements thereof as values. For the book historian, the main point of interest would typically be whether or not a particular feature appears in a book, while the historical linguists would almost invariably be interested in the relationship between those features and the texts itself. As noted at the beginning, such descriptive taxonomies or methods thereof have not been adopted by historical linguists to any great measure. The last two articles in this volume address this issue, presenting two approaches to the corpus annotation of printed title-pages from the Early Modern period.

The Lampeter corpus, compiled by Josef Schmied, Claudia Claridge and Rainer Siemund and released in 1999, was the first linguistic corpus to feature sophisticated annotation of the paratextual features of early printed texts. The Lampeter corpus was annotated in SGML following the guidelines of the Text Encoding Initiative (TEI) and, in part, developed in cooperation with members of TEI. Claridge’s article in this volume discusses the linguistic relevance of the visual elements annotated into the Lampeter corpus. Quoting Moxon’s Mechanick Exercises (1677), famously the earliest account on the inner workings of an early modern printing house, Claridge begins by noting the care and attention that contemporary printers and publishers afforded to the visual presentation of the printed book. Although such features are more accessible than ever before through facsimile images, such images are often not sufficient for real research tasks because the features of interest are not searchable nor, consequently, easily quantifiable. The article discusses three separate but closely-related topics, namely page layout, typography, and word separation. In each case the author explains the practices followed in the Lampeter corpus and the reasons behind them, and gives select examples of how the annotation aids research and helps us preserve potentially significant features which more conventional corpus editions typically omit. For example, Claridge provides statistical evidence on the use of blackletter type in the Lampeter corpus. As Claridge argues, the text-typological and diachronic variation observed not only inform us about the overall use of a particular type, but also suggests how salient words printed in that type would have been to the eye of the contemporary reader.  

The final article in the volume by Tyrkkö, Marttila and Suhr introduces the work of the Gatekeepers of Knowledge project and presents a pilot study that focuses on the title-pages of books associated with one prolific seventeenth-century medical author, Nicholas Culpeper (1616–1654). Culpeper was the first English medical writer to become a bestselling author, and his books were printed regularly for several decades after his early death. Taking as a premise that the function of the early modern title-page was to serve as an advertisement, the authors ask the question whether the many printers who profited from Culpeper’s name, some legitimately and others by less scrupulous means, can be seen to have created a recognisable style that would identify a book as a Culpeper volume.

Following a brief historical background to Culpeper and the associated printers and publishers, the article discusses two separate issues: the annotation system developed for the Gatekeepers project, and a small sampling of findings based on the annotated data. An adaptation of TEI P5 XML, the annotation scheme includes a number of new innovations including the precise recording of type sizes and spacing in page layout, and the use of identifier elements for persons and places that appear in the documents. As the authors discuss, the analysis required that highly detailed measurements, down to one fifth of a millimeter, were taken of each of the 100 title-pages. The method is extensive in detail and, consequently, time-consuming to carry out in practice, but the findings show that such high-level annotation can be useful for some research tasks. In the spirit of open access, the authors make available a copy of the project’s own annotation guidelines and the actual corpus itself. Using quantitative and statistical evidence, the authors then demonstrate how adequately annotated paratextual data can be queried and analysed with corpus linguistic methods. The case study shows how different printing houses developed and maintained specific styles, often from one generation to another, but also how the Culpeper brand came to mean that certain initially idiosyncratic features were established to such an extent that nearly all printers made use of them.


Bannet, Eve Tavor. 2005. Empire of Letters: Letter Manuals and Transatlantic Correspondence, 16801820. Cambridge: Cambridge University Press.

Barton, David & Nigel Hall, eds. 1999. Letter-Writing as Social Practice (Studies in Written Language and Literacy 9). Amsterdam & Philadelphia: Benjamins.

Beal, Joan C., Karen P. Corrigan & Hermann L. Moisl, eds. 2007. Creating and Digitizing Language Corpora, Vol. 2: Diachronic Databases. Basingstoke: Palgrave Macmillan.

Bland, Mark. 2010. A Guide to Early Printed Books and Manuscripts. Malaysia: Wiley-Blackwell.

Daybell, James. 2001. Early Modern Women’s Letter-Writing in England, 14501700. Basingstoke: Palgrave Macmillan.

Daybell, James. 2012. The Material Letter in Early Modern England: Manuscript Letters and the Culture and Practices of Letter-Writing, 15121635. Basingstoke: Palgrave Macmillan.

Daybell, James & Andrew Gordon, eds. Forthcoming. Cultures of Correspondence in Early Modern Britain.

Daybell, James & Peter Hinds, eds. 2010. Material Readings of Early Modern Culture, 1580-1700. Basingstoke: Palgrave Macmillan.

Diemer, Stefan. 2012. “Orthographic annotation of Middle English Corpora”. Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources (Studies in Variation, Contacts and Change in English 10), ed. by Jukka Tyrkkö, Matti Kilpiö, Terttu Nevalainen & Matti Rissanen. Helsinki: Research Unit for Variation, Contacts, and Change in English.

Eisenstein, Elisabeth. 1979. The Printing Press as an Agent of Change. 2 volumes. Cambridge: Cambridge University Press.

Fitzmaurice, Susan. 2002. The Familiar Letter in Early Modern English. Amsterdam & Philadelphia: Benjamins.

Grafton, Anthony. 2011. The Culture of Correction in Renaissance Europe. London: The British Library.

Huber, Magnus. 2007. “The Old Bailey Proceedings, 1674-1834: Evaluating and annotating a corpus of 18th- and 19th-century spoken English”. Annotating variation and change (Studies in Variation, Contacts and Change in English 1), ed. by Anneli Meurman-Solin & Arja Nurmi. Helsinki: Research Unit for Variation, Contacts, and Change in English.

Lass, Roger. 2004. “Ut custodiant litteras: Editions, corpora and witnesshood”. Methods and Data in English Historical Dialectology (Linguistic Insights: Studies in Language and Communication 16), ed. by Marina Dossena & Roger Lass, 21–48. Bern: Peter Lang.

McKerrow, Ronald B. 1967 [1927]. An Introduction to Bibliography for Literary Students. Oxford: Clarendon Press.

Meurman-Solin, Anneli. 2011. “Utterance-initial connective elements in early Scottish epistolary prose”. Connectives in Synchrony and Diachrony in European Languages (Studies in Variation, Contacts and Change in English 8), ed. by Anneli Meurman-Solin & Ursula Lenker. Helsinki: Research Unit for Variation, Contacts, and Change in English.

Meurman-Solin, Anneli. 2012. “The connectives and, for, but, and only as clause and discourse type indicators in 16th- and 17th-century epistolary prose”. Information Structure and Syntactic Change in the History of English (Oxford Studies in the History of English 2), ed. by Anneli Meurman-Solin, María José López-Couso & Bettelou Los. New York: Oxford University Press.

Mitchell, Linda C. & Carol Poster, eds. 2007. Letter-Writing Manuals and Instruction from Antiquity to the Present: Historical and Bibliographic Studies. Columbia, SC: University of South Carolina Press.

Moxon, Joseph. 1683. Mechanick Exercises, or, The Doctrine of handy works. Applied to the Art of Printing. London: Printed for Joseph Moxon.

Nevala, Minna. 2004. Address in Early English Correspondence: Its Forms and Socio-Pragmatic Functions (Mémoires de la Société Néophilologique de Helsinki LXIV). Helsinki: Société Néophilologique.

Nevalainen, Terttu. 2001. “Continental conventions in early English correspondence”. Towards a History of English as a History of Genres, ed. by Hans-Jürgen Diller and Manfred Görlach, 203–224. Heidelberg: Universitätsverlag C. Winter.

Nevalainen, Terttu & Helena Raumolin-Brunberg, eds. 1996. Sociolinguistics and Language History: Studies Based on The Corpus of Early English Correspondence (Language and Computers 15). Amsterdam & Atlanta: Rodopi.

Nevalainen, Terttu & Helena Raumolin-Brunberg. 2003. Historical Sociolinguistics: Language Change in Tudor and Stuart England. (Longman Linguistics Library). London: Longman.

Nurmi, Arja, Minna Nevala & Minna Palander-Collin, eds. 2009. The Language of Daily Life in England (1400–1800) (Pragmatics and Beyond New Series 183). Amsterdam: Benjamins.

Proot, Goran. 2012. “Towards a typographical atlas of the handpress book produced in the Southern Low Countries in the Early Modern period: Aims, methodology and results”. Conference paper presented on June 21, 2012 at SHARP 2012, Washington D.C.

Sairio, Anni. 2009. Language and Letters of the Bluestocking Network: Sociolinguistic Issues in 18th-century English (Mémoires de la Société Néophilologique de Helsinki LXXV). Helsinki: Société Néophilologique.

Schneider, Gary. 2005. Culture of Epistolarity: Vernacular Letters and Letter Writing in Early Modern England, 15001700. Newark, DE: University of Delaware Press.

Shevlin, Eleanor F. 1999. “‘To reconcile book and title, and make ’em kin to one another’: The evolution of the title’s contractual functions”. Book History 2(1): 42–77.

Suhr, Carla. 2011. Publishing for the Masses: Early Modern English Witchcraft Pamphlets (Mémoires de la Société Néophilologique de Helsinki LXXXIII). Helsinki: Société Néophilologique.

Tyrkkö, Jukka. Forthcoming. “Printing houses as communities of practice: Orthography in early modern medical books”. Communities of Practice in the History of English, ed by Joanna Kopazcyk & Andreas Jucker. Amsterdam: Benjamins.