1.4 Letters as a data source

This section aims to highlight aspects of letters which relate to their use as a data source in linguistic research. The points raised are based on the compiler's own research in the fields of historical linguistics, historical sociolinguistics, stylistics, pragmatics and dialectology, amongst others.

As amply evidenced by the Corpus of Early English Correspondence (CEEC; in 1996 2,4 million words) and its expanded versions, a database exclusively containing letters can be carefully structured according to socio-linguistically relevant variables such as the writer's social rank, gender, age, social and geographical mobility, and education (Nevalainen and Raumolin-Brunberg 1996, 2003). A large database, and a large number of informants, is required for a balanced account of social stratification. While the size of the CEEC is adequate for the application of the methodological principles and practices that have been specifically tailored to make it a valid tool for historical sociolinguistics, the much smaller size of the present version of the CSC only permits the use of more general classifying criteria (for information on size, see Nurmi 2002). Therefore, instead of a more detailed categorization, the writers' social rank, for instance, has been defined by the parameter values nobility and gentry, professional (excluding clergy), clergy and other. The auxiliary database including this information is being compiled in co-operation with historians.

Another distinctive feature of letters as a data source is related to their communicative function and circumstances of production. The data represents on-line language use in an explicitly interactive communicative situation. There is relatively little or no editing. There is a rich variety of idiolectal grammars, many of them virtually unaffected by standardizing trends. Since some features of the visual prosody in the manuscript texts (spacing, marked character shapes, etc.) have been digitized using an annotation system designed for this purpose, the on-line processing of thoughts in the interactive communicative act of addressing a recipient is recorded. Even hesitation, deletions, insertions and corrections have been signalled as such in the digitized texts (see Section 3.1 Visual prosody and Section 3.4 Commentary).

Corrections of this kind can, of course, be seen as evidence of some degree of editing, but usually it is not possible to find evidence that a sequence of versions were prepared before the acceptance of the one to be sent to the addressee. Although they remain mostly unedited, letters do not necessarily represent unplanned discourse. This is because the form of a letter is quite strictly regulated by genre-specific schemata and the conventions of epistolary discourse (e.g. Nevala 2004: 37-40). In contrast with texts representing other genres in diachronic corpora, of which only a sample can be included, the whole text of a letter can be examined. This is highly significant for research which adopts the semantic-pragmatic approach, and it also permits a detailed analysis of text structure with reference to discourse strategies. Earlier research has shown that it is particularly politeness strategies in general and formulaic language use which condition the choice of linguistic features (e.g., Meurman-Solin 2000 on the introduction of the relative who and Bergs 2005 on morphosyntactic variation in the Paston letters). For information on stylistic literacy as reflected in early correspondence, see Meurman-Solin (2001) and Meurman-Solin and Nurmi (2004).

The ethnography of communication, including both situational aspects and those which define the participant relationship, can be reconstructed using both language-external factors and indirect evidence, the latter provided, for example, by the choice of terms of address and discourse strategies conveying respect. The role of letter-writing manuals will also have to be taken into account in interpreting the data (cf. Nevala 2004: 33-36).

Unlike the Helsinki Corpus of English Texts ( and the Helsinki Corpus of Older Scots (, the CSC does not contain information about the variables of “level of formality” and “participant relationship” (see Kytö [1991] 1996 and Meurman-Solin 1993: 180-183). However, the user can define the parameter values of these two variables by using the information provided about the writer/informant and the addressee at the beginning of each text (see the discussion of the parameters %I and %A in Section 2.4 Language-external information in the text files).

Digital pictures of the letter manuscripts have not been provided in the present version of the CSC, and no attempt is made to describe the letters as physical objects. However, the transcription contains information about features such as the position of the date and place of writing, the positioning of the text on a folio or a number of folios, the use of margins, and any additional text positioned after the signature. The recipient's address on the folded folio is also given whenever there is one on the original. It is obvious, however, that this information is insufficient for a full reconstruction of a letter as an object; for example, information about wax seals attached to letters is not provided.


