Varieng Home

2.4 Language-external information in the text files

Language-external information about the letters and their writers and addressees is provided in two forms. Firstly, language-external information is presented at the beginning of each file (each letter is a separate file in the CSC corpus). This information is structured according to a set of parameters. Secondly, an auxiliary database is being created with more detailed information about the informants (CSC informants). The compilation of the latter requires interdisciplinary expertise, and a revised version of it will be made available at a later date.

The file-initial information is structured as follows:

# 149
{%MS: NAS GD 3/5/49}
0 0 0
{%ST: a copy in the CSC archive
%DA: 1613 March 17
%CO: Letter by Jane Drummond to Anna Livingston
%BI: previously edited in the Memoirs of the Montgomeries, vol. 1, 33: 190-191
%IF: Jane Drummond
%AF: Anna Livingston, Countess of Eglinton
%HD1: autograph, italic
%LC: Perthshire, unspecified
%FN: DrummondJane61301
%WC: 476}

Each file begins with the symbol # followed by an identification number (149 in the above example). The various parameters providing language-external information are introduced by the percentage sign and marked with a symbol consisting of two upper-case characters and a colon. In addition, the parameter HD 'hand' categorizes autograph by 1 and non-autograph by 2.

%MS: stands for 'manuscript' and gives the reference for each document as catalogued in the various archives. NAS is the acronym for the National Archives of Scotland, NLS for the National Library of Scotland, and BL for the British Library. GD refers to the Gifts and Deposits kept at the NAS, whereas Adv is an abridged form of the Advocates' Library and Dep of Deposits at the NLS. MS occurs as part of the parameter value only if it is also part of the reference in the catalogues.

This information is followed by the year of writing, given as a three-digit number (686 for 1686). The zeros on the following line may later be replaced by coordinates based on the Ordnance Survey maps, which will allow the production of digital maps, permitting the presentation of data extracted from the CSC in the format of a linguistic atlas. Since the CSC and the Edinburgh Corpus of Older Scots apply the same system of coordinates, they could be used as a combined data source. However, a completed and revised auxiliary database containing information about language-external variables will be required for the localization of the informants, and the coordinates will be added when all this information, which will necessarily draw on multi-disciplinary expertise, becomes available (see Section 1.3).

The following set of comments in curly brackets contains information which is also provided in corresponding parameters below:

{+Perthshire+} {+unspecified+} %LC: Perthshire, unspecified
{=DrummondJane=} %FN: DrummondJane61301

{+Perthshire+} permits the grouping together of all documents localized in the region of Perthshire. Since this particular letter does not contain information about the place in which the letter has been written, this is pointed out by {+unspecified+}; in numerous letters there is a place-name here (e.g. {+ Perth+}).

{=DrummondJane=} allows the identification of all letters by a particular writer, exactly the same form of the name also occurring as the first part in the filenames of those letters (see %FN: below).

%ST: stands for 'status' and describes whether a letter was transcribed in situ or whether the transcription is based on a photocopy or photograph of the original manuscript in the CSC archive. Letters transcribed in situ have only been rechecked once (cf. Section 2.3).

%DA: specifies the date of a letter, in the order year (e.g. 1686), month (e.g. April) and day (e.g. 13). It should be noted that in the Index of Sources the same information has been provided as follows: year (3-digit), month (2-digit) and day (2-digit); thus, 6860413 stands for the thirteenth of April 1686. If a particular piece of information is doubtful, a question mark follows (686?0413 in the case of a doubtful year, 68604?13 in the case of a doubtful month, and 6860413? in the case of a doubtful day). Zeros are used when information is missing (6860400). In the case of undated letters, the first or second half of a century have been given as the approximate date, drawing on information about the birth and death of the writer (?6000000 referring to the first half of the seventeenth century and ?6500000 to the second). An approximate date suggested in earlier editions or in the catalogues may also contain c for circa (c6000400).

%CO: refers to 'contents'. In the file-initial parameter, information is restricted to naming the writer and the addressee. In the Index of Sources there may also be information about the topics discussed in a particular letter. When earlier editions or catalogues provide a summary of the contents, that summary has usually been given here. The compiler has not produced any summaries of her own, given that her longer-term plan is to take advantage of software being developed for the semantic annotation of the texts.

%BI: for 'biographical data' mostly focuses on providing references to earlier editions of the document.

%IF: stands for 'informant, female', the person who signed the letter. The other values of this parameter are %IM: 'informant, male' and %IR: 'informant, royal'.

There are a few letters in the CSC in which there are two signatures. In these, the name of the person in whose hand the letter is written is positioned first in the description of the informant. It should be noted that in non-autograph letters the informant signing the letter is not its writer. The user is advised to use the parameters %IF/IM/IR: and %HD1/2: in conjunction, in order to distinguish informants represented by autograph letters from those whose own hand only appears in the signature (and sometimes also the letter-closing formula). For more information, see %HD: below.

The file-initial parameter only contains a limited amount of information on the informant. The user is advised to consult the Index of Sources for further information.

In the present version of the CSC, informants have been described by extracting information from various sources, focusing on basic facts that can be directly related to the definition of the informants with reference to the variables of time, space and social milieu. As stated in Section 1.3 Dimensions of space, time and social milieu, time and space have been considered more important than other variables which have been viewed as relevant in recent research in historical sociolinguistics. Lack of balance as regards social stratification in the present version of the CSC, there being too few informants representing the lower social classes, has prevented the formalization of parameter values related to social milieu. In other words, being a linguist, the compiler has been reluctant to translate prosopographical information into a compartmentalized and compartmentalizing system of social indices without first consulting researchers in the fields of social and economic history and cultural studies. The provision of information without the suggestion that this information can be used in a straightforward way to 'explain' the linguistic findings is a very conscious policy in the CSC (cf. Meurman-Solin 2001).

%AF: is an abbreviation of 'addressee, female'. The other values of this parameter are %AM: 'addressee, male' and %AR: 'addressee, royal'.

This information is usually based on the address written on one side of a folded letter. When the manuscript does not have an address, a suggestion is sometimes recorded in the entry in the catalogues. Since any address on the manuscript is transcribed at the end of a given text file as part of the letter, the user will know which source of information has been used.

%HD: provides information about hand-writing, stating whether a letter is autograph or non-autograph. When there are autograph and non-autograph passages in a particular letter, two parameters are used: %HD1: for autograph and %HD2: for non-autograph. The comments {hand1>} and {hand2>} in the text indicate where the autograph and non-autograph passages begin. The numbers 1 and 2 do not refer to the order in which the two hands occur in a text; instead, hand1 is always autograph and hand2 non-autograph (see also parameter %FN: below).

For example, in letters in which the body of the letter has been written by an amanuensis and the final formula and signature, for instance, by the informant, the text-initial parameters include the following information:
%HD1: autograph, italic
%HD2: non-autograph, secretary

%LC: for 'locality' specifies the region to which the writer has been localized and quotes the name of the place in which a particular letter was written. The following regions occur as values of the parameter %LC:

  • Aberdeenshire
  • Angus
  • Argyllshire
  • Ayrshire
  • Borders
  • Fife
  • Lanarkshire
  • Lothian
  • Moray
  • Perthshire
  • Ross (i.e. Cromarty and Ross)
  • South West (Dumfries and Galloway)
  • Stirlingshire
  • Sutherland

It is obvious that this system closely reflects the practical issues in the selection process (see Section 2.2 Selection of data for the CSC). The areas in the South-East are more densely demarcated, whereas in the North and the South-West localization is suggested with reference to larger areas. The fact that Sutherland is named among these regions reflects the compiler's interest in the Gordon family of Sutherland, and can also be explained by the easy access to the Sutherland deposits in the National Library of Scotland. The localization in %LC will be revised according to the forthcoming auxiliary database (CSC informants), which will contain more detailed information about the informants.

Instead of suggesting a region, two possible parameter values of %LC: specify the writer's adherence to the royal court (%LC: Court) or a professional coalition (%LC: Professional). In the present version of the CSC, the latter category is heterogeneous, containing members of the clergy, the army and the legal profession, for example. The decision not to localize professional people in this first version of the CSC is based on the general assumption that the language of members of the clergy, for example, will reflect a geographically defined variety only partly, if at all, being influenced by the shared properties of conventionalized professional discourse. Since the present version also has too few informants in the category 'Professional', a more refined categorization will be postponed until the second expanded version of the CSC.

As regards the place in which a particular letter was written, it should be noted that the place-name has either been replaced by its modern equivalent or it is quoted (using single quotation marks) in the form in which it appears in the original manuscript. The latter practice has been adopted only if a place-name remains unidentifiable. When names of castles, palaces, residences or institutions appear in a letter (e.g. Stirling Castle, Holyrood House, Whitehall), these have been replaced by names of cities (Stirling, Edinburgh, London). The parameter value 'unlocalized' is used when the writer is only known by name and further information about him/her remains unknown.

%FN: stands for 'filename' and thus gives the filename of the letter in the corpus. This filename will appear in form lists, item lists, and concordances, for example, indicating where a particular occurrence has been attested. The filenames have been selected with the following practices as guidelines: earls are referred to in terms of their position in the line of succession, e.g. 10Angus, 11Angus, 12Angus for the 10th, 11th, and 12th Earls of Angus; the title Lord is represented by L in filenames (11LFleming, the 11th Lord Fleming), the title Marquis by M, Countess by C and Duchess by Ds (4MHuntly, 12CsSutherland, 3DsHamilton). The information given in other filenames is structured in the following order:

CrichtonEE64701 Family name, initial of a family a female informant is married to: Crichton married to the family of the Earls of Eglinton
  Initial of first name: CrichtonEE64701 Elizabeth
  Year of writing (a three-digit number): CrichtonEE64701
  Number in a sequence of letters by the same writer in a particular year (a two-digit number): CrichtonEE64701

In the case of letters written in different hands but signed by the same informant, the filename indicates in which hand a particular letter was written. For example, there are four letters by James Melville, 5th son of George, 4th Lord and 1st Earl of Melville, in the CSC. These are written in three different hands, which is indicated by 'h' followed by the number of the hand; the final number indicates whether the letter is the first written in that hand in a particular year, or the second, or the third, and so on:

MelvilleJ676h11 = the first letter in hand 1 in 1676
MelvilleJ682h21 = the first letter in hand 2 in 1682
MelvilleJ682h22 = the second letter in hand 2 in 1682
MelvilleJ650h31 = the first letter in hand 3 in 1650

When there are two different hands in a particular letter (see % HD: above), the filename indicates this using the code h followed by the numbers of each hand. For example:

11LFleming686h121 = the first letter in hands 1 and 2 in 1686

It should be noted, however, that two hands are not indicted in file-names if only the signature (a name) is autograph.

The tagged texts are of type .tag. In addition, the last symbol of all filenames is e, which indicates that a text originally written in Helsinki format has been converted into Edinburgh format before subjecting it to tagging. The main difference between the two formats is the Edinburgh practice of using upper case, only contracted forms being rendered in lower case.

%WC: is an abbreviation of 'word count'. The totals are based on software provided as a standard Microsoft Word tool.


Meurman-Solin, Anneli 2001. 'Structured Text Corpora in the Study of Language Variation and Change'. Literary and Linguistic Computing, 16/1:5-27.