Preface and acknowledgements

Anneli Meurman-Solin, University of Helsinki, Finland

With links to each section of the manual.

This site contains the first manuscript-based and lexico-grammatically tagged digital database of early Scottish epistolary prose texts, with software for data retrieval and presentation as well as a comprehensive introduction and manual. The database is theoretically and methodologically innovative, in the sense that it provides alternative tagging systems representing different degrees of elaboration and indicates features of the visual prosody of the manuscript originals. The general approach to linguistic data is thoroughly variationist, with a high degree of idiolectal, local and regional variation fully recorded in the diplomatically transcribed and digitized texts of the corpus (see Part I). Auxiliary information about language-external variables related to the texts and their authors and addressees will permit the use of the corpus for historical sociolinguistic research (for language-external information provided in the text file, see Section 2.4 Language-external information in the text files and the auxiliary databases Data arranged by time and space, with word counts, Index of Sources and CSC informants).

The idea behind the compilation of the Corpus of Scottish Correspondence (CSC) draws on a long-term exchange of ideas between researchers active in the scholarly community of the International Computer Archive of Modern and Medieval English (ICAME) ( and on team-work in the Research Unit for Variation, Contacts and Change in English (, funded by the Academy of Finland and the University of Helsinki, in the area of compiling diachronic corpora. The following corpora could be mentioned here: the Helsinki Corpus of English Texts (, the Helsinki Corpus of Older Scots (, the Corpus of Early English Correspondence Sampler (, the Corpus of Early English Medical Texts (Middle English Medical Texts, compiled by Irma Taavitsainen, Päivi Pahta and Martti Mäkinen. John Benjamins, 2005; CD-ROM), and the Parsed Corpus of Early English Correspondence (PCEEC), ( which is stored at the Oxford Text Archive and ICAME. However, while the above-named corpora are primarily based on editions, the CSC consists exclusively of diplomatically transcribed and digitized versions of original manuscripts, and therefore closely resembles the corpora which function as databases for the Edinburgh historical atlases, the Linguistic Atlas of Early Medieval English (LAEME, being compiled by Margaret Laing and Roger Lass), covering the period c. 1150 to c. 1300, and the Linguistic Atlas of Older Scots (LAOS, being compiled by Keith Williamson), phase 1, c. 1380 to c. 1500 (

The creation of coherence in the theoretical and methodological approaches of the LAEME, LAOS and CSC databases requires close long-term collaboration. LAEME and LAOS are concerned with the reconstruction of the diatopic-diachronic patterns of the medieval Anglic vernaculars of England and Scotland. The basic methodology applied to these atlases derives from that used to make A Linguistic Atlas of Late Medieval English (LALME, McIntosh, Samuels and Benskin 1986). However, the methodology created for the LALME has been developed further, so that the databases of linguistic material are lexico-grammatically tagged corpora of full texts, diplomatically edited, rather than questionnaire-delimited sets of isolated word-forms (Williamson 1992/93). Furthermore, the "fit-technique", a method of interpolating texts of unknown provenance into a dialect continuum, has been computerized (Williamson 2000, Laing and Williamson 2004).

The compilation of a corpus of Scottish correspondence was motivated by my awareness that royal, official and family letters were a data source with unique properties for research that seeks to reconstruct both past language use and social and cultural practices (see Section 1.1 Reconstruction of text languages). Correspondence can be considered a unique source in the sense that it offers both linguists and historians a wide range of informants representing different degrees of linguistic, stylistic and socio-cultural literacy; the idiolects and group-lects also reflect the influence of geographical and social distance and mobility (see Section 1.4 Letters as a data source).

A number of other factors influenced the decision-making process during the creation of the CSC (see Section 1.2 Electronic data sources for Older Scots). Since three geographical areas are well represented in the Corpus of Early English Correspondence (CEEC), East Anglia, London and the North of England, the focus on Scotland seemed very relevant (The Court has been defined as a fourth area, more social than geographical. For more information on the CEEC, see Nevalainen, Terttu and Helena Raumolin-Brunberg (eds) 1996. Sociolinguistics and Language History. Studies based on the Corpus of Early English Correspondence. Amsterdam - Atlanta, GA: Rodopi; Nevalainen, Terttu and Helena Raumolin-Brunberg 2003. Historical sociolinguistics. London: Longman). In order to trace the diachronic developments and diffusion of numerous linguistic features in the history of English, directly comparable data originating from the various areas of Scotland is required.

The general purpose of the CSC is to offer the international academic community a tool for both teaching and research which will permit the study of a wide range of letters representing different types of speech, discourse and text community in sixteenth-, seventeenth- and early eighteenth-century Scotland (see Section 1.3 Dimensions of space, time and social milieu). This new tool has been designed to function as a useful data source for historical dialectology, historical sociolinguistics, historical pragmatics and historical stylistics, but it will also provide a rich resource for topics such as political and socio-economic history, cultural studies, women's studies, genealogy, and the history of Scottish handwriting. Since this manuscript-based corpus also presents a coherent view on how methods of philological computing can be applied to historical documents in modern corpus linguistics, it may be used as one of the standard tools in courses on linguistic and literary computing and manuscript studies (see Section 2.3 Transcription and digitization).

The CSC comprises approximately 500,000 words of running text representing royal, official, and family letters, based on original manuscripts dating from the period 1500-1730 and originating from the various areas of Scotland (see Section 2.2 Selection of data for the CSC). The size is comparable to the Corpus of Early English Correspondence Sampler ( In addition to information on the compilation, digitization and tagging principles and practices applied to the database, the Manual for the CSC contains a full description of the theoretical approach used, which is reflected in how variation, variability and change have been conceptualized, and of what implications this has for the system of lexico-grammatical tagging applied to the data (Part III and Part IV). Since language use in early Scottish letters is strongly conditioned by the writers' geographical and social mobility and the types of social network they are involved in, rather than just their geographical origin, the corpus data have not been translated into a linguistic atlas (Meurman-Solin 2000a-c, 2001). Thus, information has been provided about the geographical area the writers originate from and the place where a particular letter was written, but in order to define the variables of social mobility and socio-economic distance, the user will have to consult research on Scottish political, social and economic history.

The distinctive profile of this database has been created by applying a variationist approach to the tagging of linguistic and, to some extent, non-linguistic features. Instead of tagging and parsing systems in which a restricted set of conventional category labels are used to classify linguistic items word by word, either by word-class or syntactic function, the general approach in the CSC draws on principles of notional grammar, emphasizing phenomena such as categorial fuzziness and polyfunctionality, indicating potential for membership on a particular cline – one depicting nouniness or adverbhood, for example – and signalling relations between the constituent parts of collocates (see Section 3.2 Principles of taggingand Section 3.3 Practices of tagging). A system of this kind is particularly relevant in tagging certain language varieties, such as the idiolects of less-trained and inexperienced female writers in early Scotland, in which the influence of standardizing trends is barely visible.

This manual has a dual function. Firstly, it will provide practical information aimed at making the compiler's decisions as transparent as possible for new users of the database. It will illustrate alternative ways of creating research shapes of the base corpus to ensure the best possible fit between the research question and the data (for the concept of 'research shape', see Section 2.1 Protean corpora: multidimensionality, flexibility, and transparency), and of designing searches which will take full advantage of the varying degrees of elaboration in the tagging (see Section 3.2 Principles of tagging and Section 3.3 Practices of tagging). While the keys to the tags and comments will instruct the user how to interpret the tagging language (see Key to tags and Key to comments), the manual will explain the compiler/tagger's grammar, as well as her thinking with regard to features such as comments indicating zero realisation or functioning as semantic disambiguators (see Section 3.4 Commentary). Section 3.1 Visual prosody examines the digitization of the visual prosody present in the manuscript originals (i.e., non-linguistic features such as manuscript layout, paragraph structure, punctuation, particular character shapes and spacing). Digital photographs of some manuscripts have been used as illustrations, but the present version of the CSC does not contain digital pictures of all the manuscripts.

Secondly, in Part IV of the Manual there are a number of case studies published exclusively on this site. These will provide further information on the theoretical and methodological approach to the synchronic and diachronic description of past stages of language use, with a focus on the shaping of language systems and subsystems over space, time and social milieu using comprehensive inventories created by searches based on a component or a string of components in the tags. The first paper examines how the combination of typology-oriented quantitative methods and the methods of pragmatics and discourse analysis permit us to create a variationist taxonomy of discourse anaphora. The second paper entitled 'Annotating variational space over time', available online in the series Studies in Variation, Contacts and Change in English (, discusses how the fuzziness and polyfunctionality of linguistic categories can be dealt with in tagging data that reflect variation and change over time.

I would like to thank Doctor Margaret Laing (University of Edinburgh), Professor Roger Lass (University of Capetown) and Doctor Keith Williamson (University of Edinburgh) at the Institute for Historical Dialectology, University of Edinburgh, for permitting me to benefit from their unique expertise in the creation of manuscript-based diachronic corpora. A contract between the universities of Edinburgh and Helsinki has allowed me to use software created by Doctor Keith Williamson in the tagging of the CSC texts. I would like to acknowledge his very important role in the process of developing the theoretical and methodological approach applied to the CSC. I would also like to thank members of the Research Unit for Variation, Contacts and Change in English (VARIENG) for their unfailing support. Without the research assistants provided by the Varieng Research Unit it would not have been possible to complete this project; Johanna Lahti, Ulla Paatola, Elina Sorva († 2006), Riikka Tuomi, Turo Vartiainen and Minna Åkerman participated in transcribing the manuscripts. Elina Sorva was responsible for a major part of the transcription and digitization work, and also achieved a high level of expertise as a tagger during the more than three years that she participated in the project. In addition to tagging, Turo Vartiainen assisted me in the writing of the Manual, and Olga Timofeeva helped in tagging and the final editorial work. I am greatly indebted to the Helsinki Collegium for Advanced Studies (University of Helsinki) for funding my research during the period 2002-2007. Saara Paatero-Burtsov, a research assistant at the Collegium, transcribed a considerable number of manuscript letters in 2002-2003, and Jenni Laitinen and Eeva Hohti helped me in the creation of the auxiliary databases. Tuuli Tahko converted the Manual to html.

While my sincerest thanks go to all the participants in this project, I would like to dedicate the CSC to the memory of my greatly loved and much appreciated friend and colleague Elina Sorva.


