1.4 Letters as a data source
This section aims to highlight aspects of letters which relate to their use as a data source in linguistic research. The points raised are based on the compiler's own research in the fields of historical linguistics, historical sociolinguistics, stylistics, pragmatics and dialectology, amongst others.
As amply evidenced by the Corpus of Early English Correspondence (CEEC; in 1996 2,4 million words) and its expanded versions, a database exclusively containing letters can be carefully structured according to socio-linguistically relevant variables such as the writer's social rank, gender, age, social and geographical mobility, and education (Nevalainen and Raumolin-Brunberg 1996, 2003). A large database, and a large number of informants, is required for a balanced account of social stratification. While the size of the CEEC is adequate for the application of the methodological principles and practices that have been specifically tailored to make it a valid tool for historical sociolinguistics, the much smaller size of the present version of the CSC only permits the use of more general classifying criteria (for information on size, see Nurmi 2002). Therefore, instead of a more detailed categorization, the writers' social rank, for instance, has been defined by the parameter values nobility and gentry, professional (excluding clergy), clergy and other. The auxiliary database including this information is being compiled in co-operation with historians.
Another distinctive feature of letters as a data source is related to their communicative function and circumstances of production. The data represents on-line language use in an explicitly interactive communicative situation. There is relatively little or no editing. There is a rich variety of idiolectal grammars, many of them virtually unaffected by standardizing trends. Since some features of the visual prosody in the manuscript texts (spacing, marked character shapes, etc.) have been digitized using an annotation system designed for this purpose, the on-line processing of thoughts in the interactive communicative act of addressing a recipient is recorded. Even hesitation, deletions, insertions and corrections have been signalled as such in the digitized texts (see Section
3.1 Visual prosody
Corrections of this kind can, of course, be seen as evidence of some degree of editing, but usually it is not possible to find evidence that a sequence of versions were prepared before the acceptance of the one to be sent to the addressee. Although they remain mostly unedited, letters do not necessarily represent unplanned discourse. This is because the form of a letter is quite strictly regulated by genre-specific schemata and the conventions of epistolary discourse (e.g. Nevala 2004: 37-40). In contrast with texts representing other genres in diachronic corpora, of which only a sample can be included, the whole text of a letter can be examined. This is highly significant for research which adopts the semantic-pragmatic approach, and it also permits a detailed analysis of text structure with reference to discourse strategies. Earlier research has shown that it is particularly politeness strategies in general and formulaic language use which condition the choice of linguistic features (e.g., Meurman-Solin 2000 on the introduction of the relative who and Bergs 2005 on morphosyntactic variation in the Paston letters). For information on stylistic literacy as reflected in early correspondence, see Meurman-Solin (2001) and Meurman-Solin and Nurmi (2004).
The ethnography of communication, including both situational aspects and those which define the participant relationship, can be reconstructed using both language-external factors and indirect evidence, the latter provided, for example, by the choice of terms of address and discourse strategies conveying respect. The role of letter-writing manuals will also have to be taken into account in interpreting the data (cf. Nevala 2004: 33-36).
Unlike the Helsinki Corpus of English Texts
and the Helsinki Corpus of Older Scots
the CSC does not contain information about the variables of “level of formality” and “participant relationship” (see Kytö  1996 and Meurman-Solin 1993: 180-183). However, the user can define the parameter values of these two variables by using the information provided about the writer/informant and the addressee at the beginning of each text (see the discussion of the parameters %I and %A in Section
2.4 Language-external information in the text files).
Digital pictures of the letter manuscripts have not been provided in the present version of the CSC, and no attempt is made to describe the letters as physical objects. However, the transcription contains information about features such as the position of the date and place of writing, the positioning of the text on a folio or a number of folios, the use of margins, and any additional text positioned after the signature. The recipient's address on the folded folio is also given whenever there is one on the original. It is obvious, however, that this information is insufficient for a full reconstruction of a letter as an object; for example, information about wax seals attached to letters is not provided.
Bergs, Alexander 2005. Social Networks and Historical Sociolinguistics. Studies in Morphosyntactic Variation in the Paston Letters (1421-1503). Berlin and New York: Mouton de Gruyter.
Kytö, Merja 1996  (comp). Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts. Third ed. Helsinki: Department of English, University of Helsinki.
Meurman-Solin, Anneli 1993. Variation and change in early Scottish prose. Studies based on the Helsinki Corpus of Older Scots. (Annales Academiae Scientiarum Fennicae, Diss. Humanarum Litterarum, 65). Helsinki.
Meurman-Solin, Anneli 2000. 'Geographical, socio-spatial and systemic distance in the spread of the relative who in Scots'. In: Generative Theory and Corpus Studies: A Dialogue from 10ICEHL, ed. Ricardo Bermúdez-Otero, David Denison, Richard M. Hogg and C. B. McCully. Berlin: Mouton de Gruyter, 417-438.
Meurman-Solin, Anneli 2001. 'Women as Informants in the Reconstruction of Geographically and Socioculturally Conditioned Language Variation and Change in the 16th and 17th Century Scots'. Scottish Language 20: 20-46.
Meurman-Solin, Anneli and Arja Nurmi 2004. 'Circumstantial Adverbials and Stylistic Literacy in the Evolution of Epistolary Discourse'. In: Language Variation in Europe. Papers from ICLaVE2, ed. Britt-Louise Gunnarsson, Lena Bergström, Gerd Eklund, Staffan Fridell, Lise H. Hansen, Angela Karstadt, Bengt Nordberg, Eva Sundgren and Mats Thelander. Uppsala: Universitetstryckeriet, 302-314.
Nevala, Minna 2004. Address in Early English Correspondence. Its Forms and Socio-Pragmatic Functions. Mémoires de la Société Néophilologique de Helsinki, LXIV. Helsinki: Société Néophilologique.
Nevalainen, Terttu and Helena Raumolin-Brunberg 1996. 'The Corpus of Early English Correspondence'. In: Sociolinguistics and Language History. Studies Based on the Corpus of Early English Correspondence, ed. Terttu Nevalainen and Helena Raumolin-Brunberg. Amsterdam: Rodopi, 39-54.
Nevalainen Terttu and Helena Raumolin-Brunberg 2003. Historical Sociolinguistics. London: Longman.
Nurmi, Arja 2002. 'Does size matter? The Corpus of Early English Correspondence and its sampler'. In: Variation Past and Present. VARIENG Studies on English for Terttu Nevalainen, ed. Helena Raumolin-Brunberg, Minna Nevala, Arja Nurmi & Matti Rissanen. (Mémoires de la Société Néophilologique de Helsinki, 61). Helsinki: Société Néophilologique, 173-184.