General introduction [1]

Helena Raumolin-Brunberg and Terttu Nevalainen

Helena Raumolin-Brunberg & Terttu Nevalainen, "Historical Sociolinguistics: The Corpus of Early English Correspondence", 2007, Palgrave Macmillan

1. Background

The Corpus of Early English Correspondence (CEEC) was compiled within the 'Sociolinguistics and Language History' research project, which was funded by the Academy of Finland and the University of Helsinki in 1993-97. After that date, the researchers concerned with this project formed the core of the 'Historical Sociolinguistics' team in the Research Unit for Variation and Change in English (VARIENG) at the University of Helsinki, which was chosen as one of the national Centres of Excellence by the Academy of Finland for 2000-05 and 2006-11. During this period the CEEC has been enlarged and work with grammatical annotation and methodological development will continue.

The general aim of the research project was to test the applicability of sociolinguistic methods to historical data, work which could not be carried out without a corpus designed specifically for this purpose. The team originally identified the following five requirements for the data to be used: (1) the size of the corpus should be sufficient for research on morphological variation and change, (2) information on the social background of the writers and their audiences should be readily accessible, (3) the language used should represent private writing and relate closely to the spoken idiom, (4) there ought to be easy access to the material, which should be available or made easily available in a computerized form, and (5) the corpus should cover a period of time long enough for diachronic comparisons (Nevalainen and Raumolin-Brunberg 1996: 39). The chronological range from Late Middle to Early Modern English was chosen in view of our research interests and experience.

The team decided to focus on personal letters, because they fulfil the requirements better than most other text types. Correspondence is available from the 1410s onwards, the amount of data increasing with time. The people who wrote and received letters are relatively easy to identify. Previous research has shown that the language of correspondence often resembles spoken registers more closely than most other types of writing (see Biber 1995: 283-300).

In retrospect, the original five requirements have kept their validity during the more than ten years of project work. The linguistic phenomena to be studied have been expanded to areas other than morphology, such as syntax, pragmatic phraseology and grammaticalized lexemes.

The letters were predominantly digitized from edited letter collections by scanning. Some edited material, such as the Johnson letters from 1542-53, were not available in printed form, and consequently, their computerization involved the more laborious method of keying in the text. The members of the team also edited some letters to be included in the corpus (see Keränen 1998; Nevala 2001).

The CEEC corpora today consist of the five units listed below. The first two corpora have been completed, the CEEC representing the original corpus, and the CEECS the texts that were not under copyright restrictions in 1998. The Supplement contains additional material which was not available before 1998 or does not fulfil the criteria used, e.g., by having modernized spelling. The CEECE and CEEC Supplement are still incomplete, and the figures concerning their sizes are estimates (Laitinen 2002, Kaislaniemi 2006). The tagging and parsing of the PCEEC has been conducted as a joint project between the Universities of York and Helsinki (see Taylor, this volume).

  1. Corpus of Early English Correspondence (CEEC), 1998 version, c. 1410-1681, 2.7 million words, 96 letter collections, 778 informants
  2. Corpus of Early English Correspondence Sampler (CEECS), c. 1410-1681, 450,000 words, 23 letter collections, 194 informants
  3. Corpus of Early English Correspondence Supplement, (CEEC Supplement), c. 1410-1681, c. 435,000 words, 18 letter collections, c. 90 informants (in progress)
  4. Corpus of Early English Correspondence Extension (CEECE), 1681-1800, c. 2.1 million words, c. 77 letter collections, c. 310 informants (in progress)
  5. Parsed Corpus of Early English Correspondence (PCEEC), c. 1410-1681, c. 2.2 million words, 84 collections, 666 informants

Only two corpora have been released for general use, i.e. the CEECS in 1998 and the PCEEC in 2006. The CEEC, its supplement, and the CEECE can be used at the VARIENG Research Unit in Helsinki by both the members of the compiling teams and by visitors who have been granted a permission to use it.

The compilation of these corpora has been team work from beginning to end. All members have participated in the selection of material, including library visits in Finland and abroad, while the junior members have been responsible for the scanning, coding and proof-reading. Arja Nurmi has been responsible for the PCEEC parts-of-speech annotation. The teams have consisted of the following members:

CEEC, CEECS, PCEEC: Terttu Nevalainen (leader), Jukka Keränen, Minna Nevala (née Aunio), Arja Nurmi, Minna Palander-Collin and Helena Raumolin-Brunberg.

CEECE and the Supplement: Terttu Nevalainen (leader), Samuli Kaislaniemi, Mikko Laitinen, Minna Nevala, Arja Nurmi, Minna Palander-Collin, Helena Raumolin-Brunberg, Anni Sairio (née Vuorinen) and Tuuli Tahko.

In the following discussion we shall focus particularly on issues that turned out to be problematic in the creation of the CEEC. These include the use of edited letter collections, socioregional representativeness, and the encoding of extralinguistic information. It is important to bear in mind that most of the relevant decisions were made in the first part of the 1990s with the technological facilities available at that time. Our later research has shown that the system that was developed was practical and is still applicable to the more recent sister corpora of the CEEC.

2. General guidelines

The guidelines for the compilation and sampling of the CEEC naturally go back to the aims of the 'Sociolinguistics and Language History' project. The purpose was to create a versatile corpus which would constitute material for various types of research in historical sociolinguistics and would have good socioregional and quantitative coverage of informants and linguistic data. These general guidelines have remained valid even in the later expansion of the corpus.

2.1. Time span

The period covered by the CEEC ranges from the 1410s to 1681. Several factors influenced the decision to settle on this period. The early decades of the 15th century provide the first letters written in English in any large quantities, and therefore served as a natural point of departure. Fixing the cut-off point was more difficult. The need to guarantee a sufficient number of successive generations of letter-writers required coverage of the 17th century. As one potentially important factor was the possible connection between the Civil War and the rate of language change, the CEEC was extended until the last decades of the 17th century.

Research carried out on the CEEC (see e.g. Nevalainen and Raumolin-Brunbeg 2003) strongly suggested that the corpus should be extended to 1800. The findings indicated that some grammatical innovations, such as the introduction of its, only started to spread in the 17th century. There is considerable interest in monitoring these changes through the following century. The 18th century also offers more material by women and provides new types of external conditioning, such as the influence of prescriptive grammars.

2.2. Choice of informants

In order to fulfil the requirements of the best social coverage possible, the corpus team systematically looked for data from both sexes, young and old, and all sections of the social hierarchy. It was clear from the the outset that several hundred informants were needed to provide sufficient amounts of data for all the cells that the correlative analyses would require, especially when the linguistic variable was not particularly frequent. Several synchronic sociolinguistic studies of present-day English are based on interviews with 50 to 100 informants (Labov 2001: 39), but we needed more, as the CEEC was to be used for real-time diachronic research covering about 270 years.

The sampling models for guaranteeing adequate socioregional representativeness were acquired from the writings of well-known social historians, such as Laslett (1983) and Wrightson (1982, 1991, 1994). In dealing with extralinguistic data and their subsequent analysis, our general policy was to rely on social historians to make sure that our classifications and explanations were based on contemporary viewpoints and social realities (see e.g. Nevalainen and Raumolin-Brunberg, eds., 1996; 2003).

It was relatively easy to find letters by noblemen and gentlemen as well as professionals such as lawyers and clergymen, but gaining access to the ranks below the gentry was a real challenge. Throughout the 270-year period the CEEC covers, especially its first half, the rate of full literacy, i.e., both reading and writing, was very low among the lower social strata. According to Reay (1998: 40), only 15-20 per cent of labourers used signatures instead of marks in various documents in 1580-1700, while the gentry and professionals were 100-per-cent literate at the same time. The study of signatures as opposed to marks is the standard method of assessing the level of literacy. Since reading and writing were taught in succession as separate skills, the proportion of signatures constitutes the minimum estimate of those being able to read.

This socially stratified pattern of literacy was reflected in the availability of material. There were so many letters by higher-ranking informants that their inclusion in the corpus could be limited by regional criteria, for instance, whereas all data by non-gentry informants was welcomed.

Full literacy was also rare among women. According to Cressy (1980: 119-21), the overall literacy of women was still low in the middle of the 17th century, as only c. 5% of women used a signature instead of a mark. The scarcity of women's letters led to the decision to include all available material by women, even if some was not autograph. Tables 1 and 2 give the figures concerning the gender and social division of the CEEC informants.

Table 1. Informants: gender

  Men Women Women's share Total
Number of informants 610 168 26% 778
Running words 2.26 million 0.45 million 17% 2.71 million
Number of letters 4,973 1,066 18% 6,039

Table 2. Informants: social status (percentages)

  Men Women Total
Royalty 2 6 3
Nobility 12 23 15
Gentry 35 56 39
Clergy 16 6 14
Professionals 14 4 11
Merchants 10 2 8
Other non-gentry 11 3 10
Total 100 100 100

Not surprisingly, full regional coverage was not a possible goal. Most speakers of rural dialects belonged to the illiterate majority of the population. In the choice of material, our practical solution was to concentrate whenever possible on four broad areas: London, East Anglia, the North (the counties north of Lincolnshire) and the Court. The areas are self-explanatory except for the Court, by which we mean the royal family, its courtiers as well as diplomats and high administrative officers, many of whom lived in Westminster. All these areas were relatively well represented from the 15th to 17th centuries, offering a fair amount of diachronic continuity, sometimes even within one and the same family, such as the East Anglian Pastons and Bacons. These regional priorities provided an opportunity to study the supralocalization of linguistic innovations. The regional division of informants is given in Table 3.

Table 3. Informants: regional division (percentages)

  Men Women Total
The Court 9 5 8
London 15 8 14
East Anglia 17 17 17
North 14 9 12
Other areas 45 61 49
Total 100 100 100

Somewhat surprisingly, when compiling the CEECE we also encountered serious problems in finding lower-ranking informants from the 18th century. While full literacy was much more wide-spread then than in the previous century (Cressy 1980: 177) and hence there are letters written by all ranks, it seems that the letters of non-gentry informants, despite being preserved in various archives, have not been edited and published in printed collections to the same extent as from those the previous centuries. The 18th century also witnessed a new type of letter-writers, people active in literary circles. Their correspondence has inspired research into the language of these well-known social networks. [2]

2.3. Sampling and quantity of data

As the above discussion indicates, the Corpus of Early English Correspondence consists of judgement samples selected on the basis of extralinguistic criteria. The aim was to cover as broad a range of the language as was reasonably possible under the circumstances. However, with an uneven diachronic coverage of the various social strata, we could at best only aim at what Leech (1993: 13) calls a balanced corpus. [3]

Apart from social representativeness, the quantitative coverage of the corpus was an important issue. Not only did we try to secure enough data from the different social strata, but the contribution of individual informants also had to be sufficient. Whenever possible, a minimum of ten letters per writer was selected. However, occasionally writers of fewer letters would be valuable informants, especially when they came from the lower ranks or were women. In these cases linguistic material provided by several people could be pooled for the study of social stratification or regional usage. In some fortunate cases it was possible to secure letters by the same writer over long periods of time, up to 50 years or more in the case of John Holles and Elizabeth, Queen of Bohemia in the CEEC, as well as Lady Mary Wortley Montagu and Roger Newdigate in the CEECE. In these cases, enough material was selected to cover the entire letter-writing career as evenly as possible.

2.4. Authenticity

The primary material of the CEEC corpora consists of letters available in edited collections. A few manuscripts were also edited by the team members and included in the corpus. The use of edited letters means that we had no control over the philological decisions taken by the editors of the letter collections. Nevertheless, the authenticity of the data was given the highest priority in the compilation process.

The authenticity of a letter involves three separate issues: 1) authorship of the original; that is, whether it was actually written by the person in whose name it was composed or whether it was the work of a secretary or scribe; 2) the extent to which the details of the writer's social background can be identified, and 3) editorial policy as regards the original; for instance, with respect to spelling. A decision was made to include information on all these aspects of authenticity in the coding scheme of the corpus, and to provide each letter with a qualitative specification. The four codes used were as follows:

A = autograph letter in a good original-spelling edition; writer's social background recoverable

B = autograph letter in a good original-spelling edition; part of the writer's background information missing

C = nonautograph letter (secretarial work or copy) in a good original-spelling edition; writer's social background recoverable

D = doubtful or uncertain authorship; problems with the edition, the writer's background information, or both.

In retrospect, it might have been wise to divide the C-letters into two groups, those written by secretaries and those edited from copies, as these represent different types of authenticity (see 2.4.1., below). The large majority of the letters were encoded either as A or C. Some problematic material with D coding was included but later checked against the originals, after which an appropriate new coding could be given.

2.4.1. Authorship

The authorship of the letters was not always unambiguous. The ideal case was a carefully edited collection of autograph letters which were actually delivered to their intended recipients and written personally by people whose social background details were known. Collections like this exist, but not in very large numbers. Cases in point are the letters of the Barrington family (1628-32) and the letters by Dorothy Osborne to her future husband, Sir William Temple (1652-57).

There are many collections in which the majority of letters, although not all, are based on autograph sources. A typical instance is a collection pertaining to the individual who was the recipient of the autograph letters but, as far as his or her own writing is concerned, the only material that remains is a collection of drafts or a letter-book of copies. This is the case with the family letters of the diarist Samuel Pepys (1663-80), for instance. Copies of letters sent by Pepys are found in a letter-book written in a secretary's hand, which is interspersed by corrections in Pepys's own writing.

One step removed, there are entire collections of letters edited from copies, such as The Letters of John Holles, 1587-1637, which were edited by P.R. Seddon from four letter-books copied by the eldest son of John Holles at various times. Going further still, the letters written by the members of the Plumpton family from Yorkshire (1480-1550) were edited in 1839 from early seventeenth-century copies.

As in the case of Pepys, it was customary for people in high administrative offices and of the highest ranks to employ secretaries. This was more or less the rule for royal letters, at least the nonprivate ones. At the other end of the social scale, the lower and middle sections of society and women in particular had to rely on secretarial help because of their inability to write. A decision had to be made on how to deal with nonautograph letters in general and drafts and copies in particular. It was relatively easy to decide how to deal with drafts and copies that were written by the sender personally. They were treated like authentic letters; in any case they were autograph and could have been delivered to the recipient. The letter-book copies in a secretary's hand but corrected by the sender were also given high priority.

As far as uncorrected secretarial letters are concerned, there is no way of knowing whether they were dictated or written in accordance with some general instructions from the sender. They cannot be considered really representative of the sender's language. However, during the course of the compilation process it became clear that there are periods and social ranks which would fall beyond our reach if all copies were ignored. Consequently, some of them were included in the corpus, with a specific code indicating that they were copies (see 2.4., above).

2.4.2. Editions

In order to preserve authorial authenticity, only original-spelling editions were used. This means that some modernized but otherwise excellent collections were excluded from the CEEC proper, but these were placed in the CEEC Supplement. These letters would not serve as reliable material for grammatical research, let alone phonology, but do provide good data for sociopragmatic studies. Although modern-spelling editions were excluded, minor changes were accepted, such as the modernization of capitalization and punctuation and the expansion of abbreviations.

The actual editorial quality of the collections varies considerably. Some have been made for historians by historians without any philological training, while others combine outstanding historical and linguistic expertise. The earliest editions date from the first half of the 19th century, and the latest from the early 2000s. Few collections have been re-edited, although the editions might be inadequately documented and cover only a small selection of the material available. Recent publications usually give extensive accounts of the editorial principles used, while some of the older ones hardly provide any information at all. A minority of the editions also fail to give any account of the autograph status of the material. All doubtful cases of authorship, whether due to editorial oversight or lack of the necessary background information, were recorded in the coding scheme. In addition, the corpus team checked suspicious editions against the original letters, and made spot-checks on many others to establish their reliability.

2.5. Copyright

The use of edited collections meant that most of the CEEC material was under copyright restrictions. The regulations in the European Union protect copyright until 70 years have elapsed from the death of the copyright holder. Most of the editions were done in the 20th century and the editors, or the publishers in some cases, consequently still held the copyrights in the 1990s.

The compilation work was carried out without attention to possible problems in copyright clearance, since it was known that the corpus could be used as private research material without restrictions. However, copyright clearance would have been needed for the release for general use. At the completion of the CEEC, the corpus team felt obliged to release as much material as possible quickly, which led to the creation of the CEEC Sampler, a corpus of 450,000 words, containing all collections free of copyright restrictions. (For details, see Nurmi 1998). This corpus has a relatively even chronological coverage of the full CEEC period c. 1410-1680, and is especially suited to studies of high-frequency phenomena. (For the comparison between the CEEC and the CEECS, see Nurmi 2002a).

The relatively laborious copyright clearance process was started when new funding for corpus work was made available in the Research Unit for Variation and Change in English in 2000. The acquisition of copyright clearance not only meant contacting well-established publishers but also tracing the copyright holders of books whose publishers had disappeared from the market. Some publishers, such as local record societies, meet very rarely, and it took a considerable time to receive their answers. Luckily, the vast majority of copyright holders allowed the inclusion of their material in the corpus without fees. After this process, the tagging and parsing of approximately 2.2 million words of letters could begin with the goal of releasing the Parsed Corpus of Early English Correspondence (PCEEC) for public use in 2006. The size of this corpus, created in cooperation between the Universities of York and Helsinki, is largely dictated by the available resources.

3. Formats and storage methods

The CEEC corpora consist of text files in two different formats. Both formats, the collection-based one and that based on personal files, vary in file size. The largest collections include the Paston Letters from Norfolk (1425-1519?) of approximately 240,000 words, and the Johnson Letters from London (1542-53) of nearly 200,000 words. The smallest collections amount to a few thousand words, e.g., Henry VIII's love letters to Anne Boleyn. (For the collections, see Nevalainen and Raumolin-Brunberg 2003: Appendix III). The two versions so far released for general use, the Corpus of Early English Correspondence Sampler (CEECS) and the Parsed Corpus of Early English Correspondence (PCEEC), are given in the collection format, and the other CEEC corpora also exist as collections.

The personal file format consists of a large number of relatively small files, each covering all the letters written by a single informant. These files give access to individual usage and, seen against the background information of the sender database (see 4.3., below), serve as ideal data for sociolinguistic research. However, it was necessary to make some modifications to the principle of storing the letters of each individual in separate files. Since creating very small files did not seem feasible, a limit was placed at 2000 running words. The letters by people providing less than 2000 words of data were stored in chronologically divided collection files. Both the CEEC and the CEECE are available in this personal-file format.

Apart from the Research Unit for Variation, Contacts and Change in English (VARIENG) in Helsinki, in which all the CEEC corpora are stored, the CEECS is preserved in the Oxford Text Archive (OTA) and the International Corpus Archive for Modern and Medieval English (ICAME) in Bergen, Norway. Both server versions and CD-ROMs exist and can be acquired from these bodies, together with the manual (Nurmi 1998) in an electronic format. The PCEEC is deposited at the Oxford Text Archive.

4. Structure

Apart from the running text, the CEEC corpora contain some text-internal codes. Moreover, grammatical annotation has recently been encoded to a large part of the CEEC. The text-external background information on the CEEC and CEECE informants has been stored in separate databases.

4.1. Text-internal coding

As regards text-internal coding, the Corpus of Early English Correspondence follows the principles of the Helsinki Corpus of English Texts (HC). A detailed description of the conventions is given in the manual to the HC (Kytö 1996: 18-40). These codings include 'foreign language' (\.....\), 'emendation' [{.....{], 'editor's comment' [\.....\], 'our comment' [^.....^] and 'heading' [}.....}]. Old letters such as þ 'thorn' and ȝ 'yogh', which are occasionally found in Middle English letters, are coded with a plus sign preceding a letter, for instance 'thorn' is +t and 'yogh' +g. Superscripts are marked with an = sign on both sides of the letters printed in superscript, as in Ma=tie= for Matie. The CEEC coding differs from the Helsinki Corpus to the extent that the line divisions which appear in the printed letter collections are not encoded. The printed letters do not necessarily follow the line divisions in the original letters, and there is no reason to repeat the decisions made by the editors.

4.2. Corpus annotation

Apart from the corpus-internal coding described above, information has been inserted on every letter. One of the cocoa-format parameters (see 4.3., below), the 'text identifier', has been used to show the values of a number of letter-specific variables. These vary between the collection-based and personal-file-based formats, including properties such as authenticity, writer, year of writing, relationship of the recipient to the writer, and source and page number in the source collection. This information, which can be displayed in connection with every occurrence of the linguistic features under examination, has proved helpful in analysis of the findings.

As mentioned above, the grammatical annotation of the CEEC was carried out in cooperation between the Universities of York and Helsinki. Although advanced computer programs have been developed for automatic parsing for present-day written English, historical data have particular problems which have turned out to be quite difficult to deal with. The unstable spelling systems pose immediate problems for handling texts created in the past, and even subtle grammatical differences may play havoc with an analysis based on present-day English.

The program chosen for the tagging and parsing of the CEEC, the Penn Treebank, has proved its strength in the parsed versions of the Helsinki Corpus of English Texts (HC). All three sections of the HC, with some additions, have been parsed by using the Treebank, resulting in the York-Toronto-Helsinki Parsed Corpus of Old English Prose (see Taylor, this volume), the Penn-Helsinki Parsed Corpus of Middle English II, and the Penn-Helsinki Parsed Corpus of Early Modern English.

The annotation system provides part-of-speech tagging and syntactic parsing. Automatic processes are supplemented by manual corrections. The parsing system uses a limited tree representation in the form of labelled parenthesis. The main goal of the annotation is not to provide grammatical analysis but facilitate automatic searching for syntactic constructions. Hence, while it is not required of the user to agree with the grammatical analysis, the annotated corpus can be used for research in any syntactic framework. (For details, see Taylor, this volume).

4.3. Letter- and writer-specific information

It was clear from the beginning of the corpus work that sufficient background information was indispensable for sociolinguistic analysis. To meet the immediate needs of recording who wrote which letter to whom, a form for each letter was completed. This form contains the basic social information about the sender of the letter and its recipient, including name and life span, date of writing, social status and occupation, education, domicile and migration history. Information is also given on the relationship between the sender and the recipient, as well as the content and the authenticity of the letter. Further comments could also be added to the forms. These letter forms comprise the basic database of the CEEC, CEECE and CEEC Supplement.

There were alternative ways to computerize these files. At the time the decision was made, the Text Encoding Initiative (TEI), a coordinating body for the encoding of language corpora, had suggested two different ways of presenting participant information (Table 4, Johansson 1994: 208). While both represent a text header, alternative 1 is of a freer type, while the second alternative is more structured.

Table 4. Participant coding (TEI)

Alternative 1:

<participant id = P1 sex = F age = 'mid'>

<p> Female informant, well-educated, born in Shropshire, UK, 12 Jan 1950, of unknown occupation. Speaks French fluently. Socioeconomic status B2 in the PEP classification scheme. </p>


Alternative 2:

<participant id = P1 sex = F age = 'mid'>

<birth date = '1950-01-12'>

<date> 12 Jan 1950 </date>

<place> Shropshire, UK </place>

<firstLang> English </firstLang>

<langKnown> French </langKnown>

<residence> Long term resident of Hull </residence>

<education> University postgraduate </education>

<occupation> Unknown </occupation>

<socecstatus source = PEP code = B2>


The CEEC could also have followed the model that had been created for the Helsinki Corpus of English Texts, namely a 25-parameter text header in the cocoa format (Kytö 1996: 40-56). These parameters contain information which would not be useful in a single-genre corpus, but some of the variables such as 'sex', 'age' and 'social rank of author' certainly could have been. It was apparent that neither this format nor the TEI model really corresponded to the requirements of the CEEC users. There was a need to create a file on the informants which could allow searches of the data within given parameter values or combinations of them. The solution involving text headers would have required the encoding of parameter values to over 6,000 letters, leading to a corpus with a large number of interruptions. In addition, the technical problems in making searches on the basis of the cocoa-format headers needed to be resolved. Our decision was to create a separate sender database (see also Raumolin-Brunberg 1997).

Table 5. Sender database: the parameters

  1. Last name
  2. First name
  3. Title
  4. Year of birth
  5. Year of death
  6. First letter
  7. Last letter
  8. Sex
  9. Rank
  10. Father's rank
  11. Social mobility
  12. Place of birth
  13. Main domicile
  14. Migrant
  15. Education
  16. Religion
  17. Number of letters
  18. Number of recipients  
  19. Kind of recipients
  20. Number of words
  21. Letter contents
  22. Letter quality
  23. Collection
  24. Career
  25. Migration history
  26. Extra
  27. Complete

Table 5 lists the extralinguistic variables encoded in the sender database. The dBase program makes use of different types of data, including characters and numerical and logical data. Most of the parameter values are exclusive, but some also allow combinations of values, and there are three fields (Nos. 24-26) that represent open files, where any amount of information can be stored. With the help of this database, searches, indexing and counts can be made by using different combinations of parameter values. It is possible, for instance, to find all people with university education who wrote at least ten letters between 1590 and 1620. The following gives a brief description of the variables used.

The parameter 'name' comprises two fields, since surname and given name are coded separately for the creation of correct alphabetical indices. 'Title' is there to identify the person; for example, the Duke of Norfolk or the Bishop of Norwich. 'Year of birth' allows us to calculate the age of the writer at a specified point in his or her lifetime. 'Year of death' is less important, but it is known for more writers than their birth year, and allows us to place the person in the correct chronological context. 'First letter' and 'last letter' give the time span of writing. 'Sex' does not need further clarification. 'Rank' describes the social stratum the writer belongs to: nobility, gentry, merchants, and so on. If the person's social status changes in his or her lifetime, the highest social position is given here. On this principle one and the same person may have had different social positions in the database.

'Father's rank' and 'social mobility' are both there to indicate the person's social mobility. 'Father's rank' characterizes intergenerational social mobility, but 'social mobility' can also show intragenerational mobility; i.e., a situation where someone is raised to higher status during his lifetime, such as a gentleman elevated to the nobility. Geographical issues appear under 'place of birth', 'main domicile' and 'migrant', providing information for the study of supralocalization and dialectal variation. 'Education' contains information on schooling, such as university education or apprenticeship.

Among the numerical fields, 'number of letters' is self-explanatory. 'Number of recipients' is important for studies of register or stylistic variation. 'Kind of recipients' gives information about the relationships between the sender and the recipients. This code is used to distinguish family members from strangers, etc. 'Number of words' needs no explanation. 'Letter contents' gives a general idea of what type of letters the person wrote, for instance, private, official, business, or news. 'Letter quality' gives the authenticity classification described in section 2.4., above (A, B, C, D). 'Collection' indicates the short title of the letter collection or collections from which we have taken that person's letters.

Then follow three so-called memo fields, open files with no limits to the quantity of text. 'Career' gives the details of a person's career, 'migration history' information on where that person lived and moved, and 'extra' includes any other comment we might want to make. It has mostly been used to describe kinship relations. Finally, 'complete' was added to ease the filling-in process: the value 'yes' was given when the file had been completed and no additions were expected.

Table 6 illustrates the database by showing the record of one informant, Philip Gawdy, who was born in 1562, and died in 1617. The letters included in the corpus date from between 1579 and 1616. He was of male sex (M) and represented the lower gentry (GL), which was also his father's rank. His social mobility was nil (N). The place of birth was Norfolk (F), but his main domicile was London (L), where he had migrated (YL 'yes London'). He was educated at the Inns of Court (HI 'high, Inns of Court') and an Anglican by religion (A). The corpus contains 42 letters by him, addressed to five different recipients, including nuclear family (FN) and nonnuclear family (FO, 'family other'). The size of Gawdy's contribution is 23,493 words, of mixed contents (M), including news (N) and private matters (P). The letters have been edited from autograph letters, hence qualifying as (A). The short title of the collection is GAWDY. The open file under 'career' says 'younger son of younger son', and the 'mighist' file contains a comment on his move from Norfolk to London. In a recent project, the letter- and correspondent-based databases have been linked to the letter texts and a new user-friendly search system has been created.

Table 6. DBase record: Philip Gawdy

FLETT 1579
LLETT 1616
NWORDS 23493

5. Distribution and end-user issues

The CEEC corpora are distributed on a noncommercial basis for research purposes. As mentioned above, the CEECS and the PCEEC are available in the Oxford Text Archive (OTA, The International Corpus Archive for Modern and Medieval English (ICAME, has included the CEECS in its corpus CD-ROM together with other corpora, which is available for a small fee. The manuals are available in an electronic format.

The CEEC corpora can be used with the standard tools, such as the WordCruncher (see Kytö 1996: 65-67), TACT (developed by Ian Lancashire and his colleagues), WordSmith (developed by Mike Scott), and Corpus Presenter (Hickey 2003). The tagged and parsed version, PCEEC, can be employed with the search engine called CorpusSearch, developed for the purpose at the University of Pennsylvania. This program uses queries created by the corpus user (for further information, see Taylor, this volume).

Those who employ the CEEC corpora need to sign a user declaration and are expected to acknowledge their data sources.

6. Research on the CEEC

The compilation of the CEEC, which lasted five years, was partly guided by the results of the pilot studies that were carried out during the corpus work. In other words, the principles of data acquisition were continually tested in different ways. Seeing that variables such as social status and gender proved to be relevant to language change was a spur for the continuing work of compilation. Our pilot studies in Nevalainen and Raumolin-Brunberg (eds., 1996) also dealt with interactional sociolinguistics, using politeness theory to shed light on the development of address forms. These pilot studies were instrumental in helping us locate sections of society that had been inadequately covered by the corpus.

The pilot studies also showed that the information collected in the sender database formed a valuable source for the analysis of extralinguistic factors. However, we realized already at this early stage that the use of the sender database required a great deal of historical background reading by its users. In order to avoid naive interpretations, the database users had to have a fairly good command of the sociohistorical realities of Early Modern England. Although this is one of the reasons for not making the entire database publicly available, the facts concerning the writer's identity have been placed in the 'text identifier' line as a header to every letter, as mentioned in 4.2.

As of spring 2005 the CEEC has provided data for nearly one hundred publications. Three doctoral dissertations based on it, A Social History of Periphrastic do (Nurmi 1999), Grammaticalization and Social Embedding: I think and methinks in Middle and Early Modern English (Palander-Collin 1999) and Address in Early English Correspondence: Its Forms and Socio-Pragmatic Functions (Nevala 2004) illustrate the versatility of the corpus. The CEEC has offered material for novel analyses of previously thoroughly studied grammatical changes like the periphrastic do as well as research on grammaticalization and politeness in a sociopragmatic framework.

Our monograph Historical Sociolinguistics: Language Change in Tudor and Stuart England (Nevalainen and Raumolin-Brunberg 2003), offering the results of ten years of work on CEEC data, demonstrates how 14 morphosyntactic changes spread among the population of England. The volume reveals that external factors such as gender, region and social status played a significant role in the diffusion of these changes. On the whole, the CEEC has proved to be a useful tool for research in historical morphology and syntax especially. [4] In a joint project with the Helsinki Institute for Information Technology (HIIT), we have started to develop new computational methods for analysing real-time language change by using the CEEC as test material.

The sampler version CEECS has found its way to the hands of many historical linguists, who have often used it as one linguistic source among many. [5] No doubt the Parsed Corpus of Early English Correspondence PCEEC will provide material for more sophisticated studies of historical syntax.


[1] The research reported here was supported in part by the Academy of Finland Centre of Excellence funding for the Research Unit for Variation and Change in English at the Department of English, University of Helsinki.

We would like to thank Dr. Arja Nurmi for her help in providing updated material for this article.

[2] Articles on networks in 18th-century correspondence, such as Bax (2000, 2002), Fitzmaurice (2000), Tieken-Boon van Ostade (1996, 2000a,b) are based on material collected by the researchers for these particular studies. The CEECE will provide excellent material for the continuation of their work.

[3] The term 'balanced corpus', often used to refer to a balance of genres, text types and styles, has been employed here in a slightly different sense. The balance in our corpus, which comprises one genre only, has been sought for as regards social representativeness in terms of social rank, gender, geography and age, and in terms of type of correspondence, based on the contents of the letters, such as news, love, family matters and business. Furthermore, we have aimed at balance by looking for different relationships between the writers, in other words, fathers, sons, mothers, daughters, lovers, friends writing to each other. (See also Kennedy 1998: 62-63; Meyer 2002: xi-xvi).

[4] Most of the publications based on the CEEC have appeared in journals and conference proceedings. In recent articles, the members of the 'Historical Sociolinguistics' team have dealt with the following topics: sociolinguistic patterns of language change (Nevalainen 2000, 2002b,c, 2003, forthcoming a; Nevalainen and Raumolin-Brunberg 2000, 2002; Nurmi 2002b, 2003a,b; Raumolin-Brunberg 2000), letter writing (Nevalainen 2001, 2002a, 2004; Nevala and Palander-Collin 2005), corpus linguistics (Nurmi 2002a), standardization of English (Nevalainen 2003a), history of vernacular universals (Nevalainen 2006b), stable variation in history (Raumolin-Brunberg 2002), patterns of interaction in a historical perspective (Palander-Collin 2002), 18th-century social networks (Sairio 2005), code-switching (Nurmi and Pahta 2004), language change in adulthood (Raumolin-Brunberg 2005a), indefinite pronouns and their anaphora (Laitinen 2004).

[5] See e.g. Koivisto-Alanko and Rissanen (2002), Kahlas-Tarkka and Kilpiö (2002) and Heikkinen and Tissari (2002).


