Annotating variational space over time

Anneli Meurman-Solin, Research Unit for Variation, Contacts and Change in English, University of Helsinki

1. Category labels versus co-ordinates in space and time

Electronic databases have been constantly improving as regards their quantitative and qualitative validity and relevance. This general remark applies to corpora representing present-day language varieties as well as those comprising historical texts. Their size, balance and representativeness have improved; authenticity has become a widely-discussed issue; and the definition of language-external variables is increasingly based on knowledge provided by interdisciplinary research. Nonetheless, compromises of various kinds still seem to be inevitable in the lexico-grammatical annotation of these databases, especially in how linguistic features can be tagged to take into account continuous variation and change. Tagging which is sensitive to the inherent fuzziness and polyfunctionality of linguistic features is necessary in particular for data covering a long time-span.

Quality of digital resources

  • Size, balance, representativeness
  • Authenticity of data
  • Language-external variables
  • Annotation: neat category labels based on earlier non-corpus-based grammatical descriptions

As discussed in more detail in the Manual to the Corpus of Scottish Correspondence (CSC), 1500-1730, the most widespread practice in the tagging of corpus data is still for the annotation to be based on pre-corpus-linguistics grammatical descriptions. In other words, we may be using category labels for data retrieval which have been created and defined in non-corpus-based grammatical descriptions or standard reference works which draw on a limited set of data or even on a corpus which is not representative of the language variety being tagged. In an approach of this kind, compromises of various kinds offer practical solutions, but are undoubtedly in disharmony with the current theory and methods of data-based and data-oriented English historical linguistics.

The obvious reason for resorting to automatic (i.e. non-interactive) or semi-automatic tagging is that, quite justifiably, we may find it necessary to make the process less time-consuming. Despite the perceived ambiguity or polyfunctionality of particular linguistic features, we may also resort to imposing neat category labels on such data. This is often because we do not want to create a categorization system which is unduly complex, or, as we often put it, not sufficiently user-friendly.

As the compiler and annotator of the CSC corpus, I have aimed to develop a method which permits the analysis of variation and change both synchronically and diachronically. This tagging has two main goals: at the level of synchronic analysis, annotation should permit the mapping of the variational space of linguistic features, and at the level of diachronic analysis, it should permit the tracing of historical continua over a long time-span. The system is based on providing information which reflects varying degrees of elaboration, depending on the range of variation and change recorded in the data. In addition to variation reflected in the use of a single lexical or grammatical item, variational space may be shared by a group of features which are seen to be related to one another according to structural, semantic, morpho-syntactic, discoursal or textual criteria. [1] Cf. the Edinburgh linguistic atlases Laing 1993, Laing 2002, Laing 2004, Laing & Williamson 2004, Williamson 1992/93, Williamson 2000, Williamson 2001, Williamson 2004, Williamson 2005 and Stenroos in this volume.

Historical continua can be reconstructed using a set of properties in the tag of an item which function as co-ordinates for mapping processes of change. A core property string may consist of the primary word-class — that of the source — followed by a property defined by the immediate structural context and one defined by the discourse function. This practice permits, for example, the tracing of grammaticalization and lexicalization processes, and also makes trasnparent those relations between features that, due to the different pace or intensity of change, would otherwise be difficult or impossible to retrieve.

Key concepts

  • Synchronic: variational space
  • Diachronic: historical continua

2. The tagged CSC

The Corpus of Scottish Correspondence (CSC) is a corpus of sixteenth-, seventeenth- and early eighteenth-century correspondence which comprises diplomatically transcribed and digitized original manuscripts of royal, official and family letters. The language this collection of letters represents is referred to as Scottish English. This is a language variety which reflects a great degree of heterogeneity; a relatively high degree of uniformity resulting from standardization is only attested from the mid-seventeenth century (Aitken 1971, Devitt 1989, Meurman-Solin 1993, Meurman-Solin 2000a, Jones 1997, Corbett et al. 2003).

The CSC is a web-based resource; in addition to the digitized manuscripts, which include file-initial language-external parameters, the following auxiliary databases are also available on this website: the Manual, the Index of Sources, a catalogue of CSC informants, Data arranged by time and space, with word counts, a Key to tags, a Key to comments, Software for data retrieval, analysis and presentation, and a bibliography Further reading on Older Scots.

Corpus of Scottish Correspondence (CSC) 1500-1730

  • The CSC comprises diplomatically transcribed and digitized editions of original manuscripts of royal, official and family letters by both male and female writers originating from the various areas of Scotland.
  • The data are tagged using software developed by Keith Williamson (Institute for Historical Dialectology, University of Edinburgh).
  • The CSC is a web-based resource, with a manual, auxiliary databases containing language-external information and software for data retrieval and presentation.

I have found the annotation of correspondence especially interesting because epistolary prose represents online language use in an interactive communicative setting. A further benefit is that, in contrast to editions, which usually introduce modern punctuation, the digitized manuscripts permit the analysis of sentence and clause structure. [2]

The following digitized manuscript will illustrate the type of document included in the CSC corpus:

Manuscript data

Published by the kind permission of the Trustees of Sir David Ogilvy of Inverquharity, Bt.

Trustie and luiffing freind Ze sall witt that I \ haue appointit to be vpoun ye grund of contravertit \ land betuix Pittarro and edzell the xxviij of apryll \ nixtocum And Becaus that I resolue to speik \ wt my tennentis of keremure Anent the libertie \ of yar burt THairfor I thocht meit to aduerteis \ zow that I will be wt zow ane nicht ayer in the \ weik efter pasche qlk will be my furthgoing Or - \ ellis in my heamcumming qlk will be in ye weik efter \ lawsounday And this I thocht gude to mak zow \ aduertisit of And sua to farder occasioun I rest \ Zouris assurit freind \ V D Erll Anguss \ {adjacent>} Edinburgh xxxj Martij 1606 {end} (NAS GD205/1/34 William Douglas, 10th Earl of Angus to Sir John Ogilvy of Inverquharity)

Some features of visual prosody (see the CSC Manual: Visual prosody) [3] in this manuscript affect grammatical categorization, and certain variants also suggest categorial fuzziness. There are a number of instances of the connective and, the text-structuring function of which can be identified by the considerably extended shape of the initial character, which is clearly different from the shapes this character has elsewhere in the text. These extended shapes have been digitized in upper case, preceded by an asterisk (e.g. *AND). [4]

3. The theoretical approach to variationist annotation

The general theoretical approach to the tagging of the CSC data can be defined with reference to the following key elements:

Key features of the variationist typological approach to annotation

  • Continued variation and variability hypothesis
  • Relevant relations between features identified on the basis of structural and semantic criteria
  • Distinction between textual, discoursal, sentential, clausal and phrasal levels of analysis
  • Discourse basis of categorization, with fuzziness and polyfunctionality as inherent features in both synchrony and diachrony
  • Inclusion of zero-realisations in variational paradigms
  • Variation resulting from elaborated to compressed processing of information

3.1 Continued variation and variability hypothesis

The tagging is designed to reflect a profoundly variationist perspective. The shape of Scots, conditioned by time, place and social milieu (see the CSC Manual: Dimensions of space, time and social milieu), is assumed to reflect continued variation and variability at the idiolectal, local, regional, and supraregional levels, resulting in a high degree of language-internal heterogeneity, which is further increased by contact with a number of other languages and language varieties. A sophisticated tagging system is required for identifying and analysing such complex patterns of variation and for tracing multidirectional processes of change, including the assessment of their relative intensity in terms of time and space.

While stressing continued variation and change, it should be noted that, in applying the variationist approach to the reconstruction of historical varieties, no pattern of variation can be recorded fully. Due to the scarcity of texts from the early periods in particular, there are usually significant gaps in our database. Only in ideal circumstances can all the members of a particular pattern of variation be attested as alternatives in the data; in such cases the information provided by the tags would permit the creation of comprehensive inventories of such patterns. In practice, if only fragments rather than a balanced and representative collection of texts are available, we will also have to consider the implications of potential but unattested variants; one way we might achieve this would be to draw on typological information about corresponding systems in later periods, or those representing other languages or language varieties. Thus, while even highly sophisticated data-retrieval systems will not permit us to depict the full scale of variation in Older Scots, the corpus-based variationist approach, with its deeply ingrained emphasis on being data-driven and data-oriented, will provide a more reliable synchronic and diachronic description of this language variety.

3.2 Relevant relations between features identified on the basis of structural and semantic criteria

In the CSC system, structural (including contextual) and semantic properties are viewed as of primary importance. The wide range of clause-combining devices and major changes in their inventories (cf. Kortmann 1997) provides an appropriate example to illustrate the application of semantic criteria to the identification of features which create links between chunks of text, whether these are phrases, clauses, sentences, paragraphs or passages that can be analysed as units on the basis of their communicative function. In the analysis of connectives, the system provides information about features which have been interpreted as having semantic potential to build relations between clausal structures or larger units of text, irrespective of whether such links reflect grammaticalization.

For example, rather than suggesting the syntactic category of conjunction, that is, claiming that a grammaticalization process has been completed, the tag of items such as end in the phrase to the end that indicates that, in the context in which this feature has been attested, it has the semantic potential to function as a connective:





n = primary word-class
av = in an adverbial PrepP
cj = with a connective function
c = with the complementizer that
post = introducing a subordinate clause occurring after the main clause

The meaning 'an intended result of an action; an aim, purpose' (OED end n 14.a-b) permits the investigation of this connective phrase as one of several linking devices with the meaning of purpose.

For more information on the structure of the tags in the examples provided by this paper, see the CSC Manual: The structure of a tag.

Considering semantic properties as primary also solves the problem of structural and syntactic ambiguity:

Dispelled ambiguity

  • We shall wait unto the time we receive your lordship's answer to this letter.




    {zero that<}

  • Cf. variation with to the time (that), until the time (that), until (that), till (that), to

Annotation problems caused by ambiguity have been solved by providing information about both the prototypical (which is usually also the earliest) use of a particular item and its use in discourse in the text being tagged. In the sentence above, the category membership of the word time is described in terms of a noun (n) being used in a prepositional phrase as an adverbial (av) functioning as a connective in the semantic role of time (cj). The element '-c' is co-referential with the comment {zero that<}, both indicating that there is no complementizer (cf. unto the time that). Using tags of this kind, it is possible to create an inventory in which connectives such as to the time (that), until the time (that), until(that), till (that) and to ('until') are part of the same variational pattern (see also Meurman-Solin 2002 and Brinton 2007).

For more examples, see the CSC Manual: Connectives.

In the present approach, structural and semantic properties are considered primary, whereas syntactic properties are interpreted as secondary. In other words, the core function of structural and semantic information is descriptive — descriptive at the micro-level — while that of syntactic information is interpretative — interpretative at the macro-level, i.e. intended to identify grammatical rules and constraints.

Thus, all features that have the semantic potential to be interpreted as combining devices are included in inventories of connectives that are analysed in great detail, such inventories being complemented by reduced and zero realisations of links. Tags are elaborated in order to facilitate data retrieval for these comprehensive inventories. The various realisations of links are then subjected to a typology-oriented examination within their semantic category, the ultimate result of the analysis permitting the structuring of the information into variationist typologies, i.e. typological accounts of linguistic systems and subsystems which apply variationist principles.

Another example illustrating the primacy of semantic criteria in the annotation can be provided by the pattern of variation in the realisation of indefinite pronouns. In order to trace change over time as reflected in the history of indefinite pronouns, elements such as thing, one, body and man have been grouped into a variational pattern marked by property strings and contextual indicators of the following kind:

• anything



• anybody



• anyone



• any man



Again, instead of classifying an item such as anyone into the discrete category of pronoun, the co-ordinates listing the core properties of the two elements separately make their fuzziness transparent, and the annotator is able to refrain from fixing their position on the grammaticalization continuum.

3.3 Distinction between textual, discoursal, sentential, clausal and phrasal levels of analysis

The annotation system applied to the CSC distinguishes between the textual, discoursal, sentential, clausal and phrasal levels of analysis in areas which have been subjected to elaboration. This principle was primarily formulated to create an annotation system which would be sensitive to idiosyncracies of early epistolary prose. Moreover, since devices such as regularized punctuation and the use of capital letters do not always permit the identification of sentence structure in early letter manuscripts, annotation is required to provide information for the retrieval of data relevant to a particular research question. The solutions in the present annotation system are of course just the first step, and longer-term experimentation by several new users will undoubtedly suggest how to develop the process further.

Since the use of explicit combining devices [5] at the textual, discoursal and sentential levels is a salient characteristic in the CSC letters, the annotation indicates sentence boundaries using a tag-internal comment {ts} 'text-structuring' for the text-level connectives and, but, so and for, and also marks the absence of a link using tag-external comments such as {zero pre} 'no text-level connective linking the preceding context to a simple sentence or the first clause of a complex sentence' and {zero post} 'no text-level connective relating the main clause to the preceding clause'.

This annotation practice can be illustrated as follows:

'zit iff I get zour will I hoip to giwe hir ll [ladyship] contentment in consciens'



{zero that<}


$get/vsjps11<P+{cond}_GET $/vsjps11<P+{cond}_0




{zero post}


$hope/vps11<P+_HOIP $/vps11<P+_0







$content/n{rc}>pr_COnTENT+MEnT $-ment/xs-n{rc}>pr_+MEnT



The adverb yet, a logical connector in the role of concession, functions here as a text-structuring connective, whereas the tag-external comment {zero post} indicates that the second unit in a correlative pair such as if...then, recorded elsewhere in the data, is absent.

A set of connectives may be members of the same semantically-defined inventory, but their membership in a syntactically-defined one is not as straightforward as the traditional categorization into conjunctions and adverbs would suggest (see Meurman-Solin 2004: 188-192 on the distinctive features of such subordinators of cause as since, as, because and for and connective adverbs such as therefore at the level of text and information structure). The assessment of for in particular requires a careful analysis of the data (Rissanen 1989, Rissanen 1999, Kohnen 2007, Lenker 2007). In this context, the theoretically significant aspect is that, when the qualifications of these items for membership in a particular category at the levels of text and information structure are assessed, it becomes clear that they are not members of the same patterns of variation (Meurman-Solin 2004). In the CSC, for is frequently used as a text-structuring connective, i.e. at the same level as and, but and so. For examples, see the CSC Manual: Connectives. As regards information structure, topic-forming connectives are marked as such by the comment {tf} (see also Meurman-Solin & Pahta 2006 on the connectives seeing and considering introducing clauses containing given information).

An obvious case for distinguishing between clausal and sentential levels is the variational pattern of relative markers in the CSC data, there being relatives which introduce clauses as postnominal modifiers as well as those functioning as independent constituents at the sentence level, the latter of which lack a syntactically-defined antecedent (Meurman-Solin 2007). In addition, there are relatives whose function should be defined by criteria related to discourse and text structure. For example, such relative compounds as wherefore, whereupon and whereat have been attested in text-structuring functions, in which they signal the appearance of a communicative act of a particular kind, most frequently a request or a wish, or function as cohesive elements in narratives. The discoursal level of analysis focuses on links marking sequences of communicative acts in a letter, whereas the textual level of analysis draws the user's attention to devices used to structure the letter as a whole. Thus, while the relative phrase until which (time) often introduces a polite letter-closing formula, wherefore may be used as a text-structuring element.

3.4 The discourse basis of categorization, with fuzziness and polyfunctionality as inherent features in both synchrony and diachrony

The interactive tagging, based on the application of software created by Keith Williamson, Institute for Historical Dialectology, University of Edinburgh, permits the annotator to analyse and interpret each occurrence of an item in context. In epistolary prose, the discourse function of linguistic features reflects the communicative setting of letter-writing closely, and numerous choices of expression can be related to conventionalized language use in general, and to politeness strategies in particular. For example, the frequent use of independent (i.e. unattached to a main clause) non-finite clauses with the first-person subject left implicit is a distinctive feature of epistolary prose, requiring a tailored tagging practice. The core constituent in these non-finite clauses is usually a present participle of an optative verb or a verb of volition. [6]

In their important article on the discourse basis for lexical categories, now republished in the OUP reader Fuzzy Grammar, Hopper & Thompson (2004: 248) discuss the integration of 'the notional side of categories with their pragmatic function in language use'. While they accept the broad correlation that, for example, 'certain prototypical percepts of thing-like entities will be coded in a grammatical form identifiable as N' (ibid.: 249), they set out 'to show that semantic congruence is actually rooted in predictable pragmatic (discourse) functions.' Moreover, in their view, even though semantic features which are used to assign 'concrete, stable things' (such as visibility) to Ns and 'kinetic, effective actions' (such as movement) to Vs are relevant, these features 'do not seem to be adequate for assigning a given form to its lexical class' (ibid.: 251). This is because '[p]rototypicality in linguistic categories depends not only on independently verifiable semantic properties, but also — and perhaps more crucially — on linguistic function in discourse.' This study also summarises the previous research on the notion of basic categories, pointing out that the claim that the two nuclear categories are N and V, and even that these can be considered universal, is generally accepted. [7]

In the CSC system, the string of co-ordinates in the grammel provides the user with information about both the lexical class and the discourse function of the tagged item; it may also contain comments which permit semantic specification or disambiguation. However, this information does not permit straightforward categorization of the item; instead, the tag demarcates a 'variational space' within which an item can be examined in a valid and relevant way. The concepts of 'potential for category membership' and 'variational space' play a key role in the creation of comprehensive inventories for the study of continuous variation and change. See also the CSC Manual: Principles of tagging.

Discourse-based tagging can be illustrated with the following three examples: Right honorable as a term of address is tagged $right/av, $honour/aj-n{ho}-voc and $-able/xs-aj-n{ho}-voc, in which {ho} comments on the honorific function, 'voc' being an abbreviation of vocative. The two core category names connected with a hyphen in the tag (here aj-n) indicate the nominal use of an adjective.

Similarly, in the variational pattern of conform to and conformand / conforming to ($conform/aj-pr>pr, $to/pr<aj-pr and $conform/vpsp-aj-pr>pr, and $to/pr<vpsp-aj-pr), the core property co-ordinates aj-pr and vpsp-aj-pr function as co-ordinates for the tracing of the historical continuum of these items and other prepositions originating from verb forms.

Multi-unit connectives such as as soon as provide links between the correlative pair $as/av>cj and $as{time}/cj<av, with soon being assigned to the category of adverbs occurring in a connective function by the core properties av-cj in the grammel. The varying semantic roles of connectives representing this type (cf. as / so long as, as much as, as far as, etc.) are specified by a comment in the lexel (e.g. in as long as, as{cond} is distinguished from as{time}). See also the CSC Manual: The structure of a tag.

The tagging principles have been influenced by the discussion of notional or conceptual properties, with elaborated tags making the interrelatedness of the members of a particular notional category explicit (Anderson 1997, Jackendoff 2002). Even though the conventional categorization of items into parts of speech is applied to the tags, they also succeed in indicating fuzziness and polyfunctionality by referring to a number of co-ordinates on a cline rather than insisting on membership in a single category. Scalar concepts such as 'nouniness' and 'adverbhood' reflect this approach. I would like to mention that one of my personal research interests is the examination of the discourse basis of nouniness, enabling me to relate the discussion of how discourse analysis meets typology to parameters in Lehmann 1988. These are relevant in depicting the continuum from maximal elaboration to maximal compression in clause linkage, nominalization — resulting from what Lehmann calls desententialization — being closer to the compressed end of the cline (see below and the CSC Manual: Principles of tagging and Nominalization).

Even though the 2007 version of the CSC only permits the study of developments in epistolary prose up to 1730 (for other electronic data sources for Scots, consult the CSC Manual: Digitized data for Scots), the tagging system has been designed to enable the tracing of developments up to Present-Day Scottish English. The present research interests of the compiler/tagger, the reconstruction of historical continua in areas such as clause-combining devices, pronominal reference systems — relatives and demonstratives in particular — and nominalization, is reflected in the tagging, with these features being subjected to a particularly great degree of elaboration. The tracing of continua requires that the source of an item or collocate continues to be transparent over time, even though a later grammaticalization or reanalysis, for example, would permit recategorization. Thus, anyway is tagged as follows:

• anyway

{zero pr}



In addition to the tag-external comment on the absence of a preposition (i.e. {zero pr}), which permits the user to differentiate between prepositional and non-prepositional adverbials, the constitutive elements are made transparent in all uses of this item, even in contexts in which it is used as a discourse particle.

Similarly, even though there is evidence of the grammaticalization of a number of well-established connectives which incorporate nouns in Scots by the sixteenth century, these have been given tags that indicate their development over time. This decision makes it possible for the users of the database to create inventories of all clause-combining devices which incorporate nouns, irrespective of the degree of grammaticalization or lexicalization they may reflect in the various contexts in which they have been attested. In fact, it may sometimes be impossible to define exactly where a particular instance should be positioned on the historical continuum. Thus, the well-established subordinator because, which usually occurs in the CSC in univerbated form and without the complementizer that, is tagged as follows:

• because (that)



{zero that<}

The grammel pr-cj, attached to the preposition by, relates this element to all prepositions that introduce a nominalization of some kind, in this instance the nominalization of a causal process, the comment {rc} standing for 'reduced clause'. The adverbial function of by cause is also indicated, while the third core property records the use of the phrase as a connective, this time as a subordinator without the complementizer that ({-c and {zero that<}; 'emb' indicates embedding, and 'post' the sentence-final position of the subordinate clause.

The same nominalization also occurs with a prepositional complement:

• because of




The degree of lexicalization does not affect the practice of providing co-ordinates for tracing processes of change from source to attested usage. For example, the units of compound nouns (see the CSC Manual: Affixation and compounding) are tagged separately even if the combination has undergone considerable semantic change:

• gentlemen


$man/npl-k<aj_+MEN ~

• wellfare


$fare/n{rc}-k<av_+FAIR [8]

To sum up this section, the CSC annotation praxis makes fuzziness and polyfunctionality explicit by referring to co-ordinates on a cline. As a result, there is no need to insist on membership of a single category for a given word. The order of these co-ordinates is carefully controlled, and the hierarchy between different types of information is transparent. Core properties such as form, word-class and function precede components providing contextual information. On the one hand, the co-ordinates permit the positioning of a particular feature in variational space, while on the other they trace developments over time. In other words, no tag is merely an interpretation of a particular occurrence in a particular context, but provides information on all the different stages of development of a given item, faithfully reflecting historical continua. In this system, ambiguity is no longer a problem, because ambiguity, or fuzziness, has been dispelled by providing all the co-ordinates necessary to reflect variation and change over time. In order to trace developments that take place over a long time-span, such as grammaticalization processes, it is necessary to indicate the various stages, beginning with the origin, listing properties perceived in the analysis of pragmatic inferences, identifying examples which provide evidence of an ongoing process of grammaticalization, and, in some clearcut cases, stating the grammaticalized use.

The examples given above are intended to illustrate that, in the majority of cases, categorial fuzziness can be dispelled by describing the properties of a feature with reference to its origin as well as its context of occurrence, positioning it on a cline. It is suggested that this practice will allow the tagger to depict the data faithfully, without resorting to the kind of streamlining which is unavoidable in traditional tagging and parsing methods. Since tags should be as theory-neutral as possible, and the role of each individual tagger as an interpreter of linguistic data should remain as low-profile as possible, the use of descriptive clines of properties is recommended to permit the creation of more valid inventories of features than more compartmentalizing tagging practices (see also Wallis in this volume).

If ambiguity cannot be made transparent by tagging, tag-external comments such as {ambiguity>} and {syntactic merger>} have been added. These alert the user that they should re-examine the problematic structure which follows the comment.

For a more detailed discussion of fuzziness, see the CSC Manual: Principles of tagging.

For more examples, see the CSC Manual: Practices of tagging.

3.5 Inclusion of zero-realisations in variational paradigms

In the CSC annotation system, zero-realisations are included in patterns of variation (see the CSC Manual: Principles of tagging). In other words, a zero-realisation is one of the attested variants on the cline from zero to reduced or elliptical to 'full' variants, this cline being construed on the basis of corpus evidence. The term zero-realisation is assumed to be more appropriate than term types, commonly used in grammars, which specify the deleted feature (e.g. zero complementizer and that-deletion). In the CSC, a zero-realisation is indicated when a variational pattern comprising both explicitly expressed variants and those left implicit has been attested in the data.

Zero-realisations are primarily indicated using comments in curly brackets (cf. the discussion of grammels introduced by a zero at the end of this section). Ellipted items are indicated using comments such as {zero v}, {zero aux}, {zero that}, and {zero S}. The following example contains three zero-realisations, the first two being the relative pronoun and the verb; the clause type is a verbless relative structure, an alternative to the finite clause 'who are next (to) my sovereign':

'all vtheris nixt my souerane'


$other/pnpl>R_VTHER+IS $/plpn>R_+IS

{zero rel}


{zero v}


{zero pr<aj-pr}



In the above example, the comment {zero rel} is followed by a detailed description of the zero relative. Grammels of these relatives are always preceded by 0.

The third instance of a zero-realisation in this example illustrates the practice of emphasising attested variational patterns in tags; the tag indicates that the prepositional type next to has also been recorded in the CSC. Rather than ask the users of the database to design searches based on a list of lexels reflecting variation of this kind, the combination of information in the grammel and the comment permits them to create a full inventory of prepositional and non-prepositional items.

Similarly, the variation between so that and that — either in the semantic role of result or that of purpose — is indicated in the following way:

{zero av>cj}


{zero av>cj}


The combination of the comment {zero av>cj} and the elements cj<av in the grammel permits the retrieval of all connectives sharing these components.

The recorded variants of the verb please, which mostly occurs at the beginning of a letter, suggest that instead of analyzing it as an impersonal verb, the following tagging practice provides the most efficient tool for data retrieval:

'plesit the saming yt'

{zero formal S}

$please/vsjpt13<S-_PLES+IT $/vsjpt13<S-_+IT


$same/pn<T_SAmING ~

{zero vi<S}



The omission of the predicate verb (usually the verb to know) from a postposed nominal infinitive clause with a nominal that-clause object is quite frequent in the CSC, the variant being it please NP [the addressee] to know / to be informed that ... .

The following example contains a directive:

'get as exact information & knowledge of euery thing as you can possible'

{zero pre}














{zero vi}

$possible/av_POSSIBLE $-ly/xs-av_0

Directives of this kind are frequently related to the preceding context using adverbs such as so or therefore, or the connective and, which in some idiolects may introduce virtually every new proposition. The absence of a link of any kind is here indicated by the comment {zero pre}.

Correlative pairs of connectives (e.g. although ... yet, as ... so, since ... therefore) are considered to be variants representing a high degree of explicitness on the continuum which depicts the clause-combining system in the CSC data, while the opposite end of the cline is defined as the absence of explicit links:

'althought deutie and obligation did not ingadge me yett gratitude oblidges me to lay hold on euerie ocasion'












$oblige{cause}{lat}/vps13<n+_OBLIDG+ES $/vps13<n+_+ES









The absence of yet in this example would be marked with the comment {zero post}.

A limited set of linguistic features have been selected for an experiment in which, in addition to the comment indicating zero-realisation, a separate lexel + grammel introduced by zero (e.g. 0RO for the zero-realisation of a relative pronoun as object) is provided. In participial relative clauses, for instance, the absence of a subject relative pronoun is indicated as follows:

'the lordis of my souerane lordis and maisteris counsale chosin in parliament'



$lord/npl>pr_LORD+IS $/pln>pr_+IS




$lord/nG{ho}_LORD+IS $/Gn{ho}_+IS


$master/nG{ho}_MAISTER+IS $/Gn{ho}_+IS


{zero rel}


{zero aux}

$choose/venpp{pass}{rel}_CHOS+IN $/venpp{pass}{rel}_+IN



Another device for indicating zero-realisation is the use of a zero only in the slot provided for variants attested in the text, i.e. the slot following the sign _. For example, in accordance with the so-called Northern Subject-Verb Rule, uninflected verb forms in certain contexts have a grammel of the following kind:

'my frendis know'


$friend/npl_FREND+is $/pln_+is

$know/vps23<npl+_KNOW $/vps23<npl+_0

In this example, the fact that the verb know is uninflected in this particular context (immediately preceded by an NP subject) is signalled by the inclusion of a zero in the slot in which the inflectional morpheme is usually positioned. This instance could be compared with the following:

'and prays you'


{zero S}

$pray/vps11<P-_PRAY+S $/vps11<P-_+S


The verb pray appears in a suffixed form, and has no immediately preceding subject (for more detailed information, see Meurman-Solin 1992).

The morpheme -ly in open-class adverbs is tagged as a suffix (/xs-av), and its absence is interpreted as a zero-realisation.

• exceeding

$exceed/vpsp-aj-av_EXCEED+ING $/vpsp-aj-av_+ING $-ly/xs-vpsp-aj-av_0

• exceedingly

$exceed/vpsp-aj-av_EXCEED+ING+LY $/vpsp-aj-av_+ING+$-ly/xs-vpsp-aj-av_+LY

The tag for the verb form is positioned before those for word-class and function in tags of participial adjectives as adverbs.

3.6 Variation from elaborated to compressed processing of information

The volume Connectives in the History of English (Lenker & Meurman-Solin 2007) includes a number of studies which illustrate how comprehensive corpus-based inventories permit us to identify continua in diachronic developments that would remain undetected without concepts such as fuzziness, polyfunctionality and reanalysis in our toolkit. A particularly useful framework for an inventory of systems of clause linkage emerged from the typological investigation of the most important aspects of complex sentence formation in the languages of the world by Lehmann (1988). Lehmann defines six parameters, ranging from two maximally-elaborated paratactic clauses with finite verbs and no syntactic embedding at one end, to a single clause containing an embedded non-finite predicate with no complementizer or other element signalling embedding at the other. The possibilities thus range from a pole of 'maximal elaboration' to a pole of 'maximal compression (or condensation)'. The six parameters are as follows: i) hierarchical downgrading of the subordinate clause (from weak parataxis to strong embedding); ii) main clause syntactic level of the subordinate clause (from high sentence to low word); iii) desententialization of the subordinate clause (from weak clause to strong noun); iv) grammaticalization of the main verb (from weak lexical verb to strong grammatical affix); v) the interlacing of two clauses (from weak clauses adjunct to strong clauses overlapping); and vi) the explicitness of the linking (from maximal syndesis to maximal asyndesis).

In my present research, the connectives examined range from text-structuring adverbs and various types of conjunctions to compressed and zero realizations of links (cf. Meurman-Solin 2004).

Variational space of connectives

  • conjunctions (incl. single-word items and conjunctive phrases reflecting varying degrees of grammaticalization and varying degrees of integration and subordination
  • polyfunctional text-structuring elements (e.g. and, but, for and so)
  • connective adverbials (correlative and non-correlative)
  • relatives as sentence-level connectives
  • non-finite and verbless adverbial clauses
  • desententialization (resulting from the nominalization of adverbial clauses)
  • grammaticalization of superordinate verbs with a subordinative potential (e.g. causative and optative verbs)
  • inverted word order indicating subordination
  • zero-realization
  • non-linguistic text-structuring devices indicating type of connection

Since a number of connective types have been discussed earlier in this paper, I would like to focus here on nominalization and the grammaticalization of the superordinate predicate with causative constructions. Nominalizations can be retrieved using the comment {rc} 'reduced clause', which is attached to the lexical class of an item:

'but ony delayis or ofputtingis'



$delay/npl{rc}-av_DELAY+IS $/pln{rc}-av_+IS



$put/vnpl{rc}-k-av<av_PUT+TING+is $/vnpl{rc}-k-av<av_+TING+is $/plvn{rc}-k-av<av_+is

A preposition with a nominalization as its complement is given the grammel pr-cj to stress the fuzziness between prepositions and conjunctions in this context.

Lehmann discusses nominalizations as examples of desententialization. He points out that 'the more a verb gets nominalized, the more it starts behaving like an ordinary noun. It is in this sense that we may speak of the increasing nominality (or 'nouniness') of subordinate clauses, when they are reduced by desententialization' (Lehmann 1988: 197; see Figure 3, p. 200):

'at my coming from Ireland to Edinburgh'



$come/vn{rc}-av>pr>pr_COM+ING $/vn{rc}-av>pr>pr_+ING





'your happy delivery of a young Charles'








As regards causative constructions, Lehmann (1988: 201-202) uses the Italian example 'Ho fatto prendere a mio figlio un'altra professione' ('I had my son choose another profession') to illustrate how a verb 'combines directly with a subordinate verb to yield an analytic causative verb'. In the CSC, the following tagging permits the creation of an inventory of varying causative structures which reflect different degrees of grammaticalization:

have + object + bare infinitive

'[which I would] have you compare [diligently with him]'



{zero im}


have + object + present participle

'[your grace would have] had my wife coming [to your grace]'




$come/vpsp-av_COM+ING $/vpsp-av_+ING

see + object + past participle

'[I ask you] to sie it sold'





cause + nominal that-clause

cause + object + bare infinitive

'cause ann receaue it'


{zero that&Oinf}




This last example illustrates a case of ambiguity which can only be resolved by providing the two alternative readings in a comment. Since the verb cause mostly occurs with a to-infinitive complement, in this particular case the tagger has opted for the first reading in providing the grammel for receaue, but of course this decision is not based on detailed research. In investigating infinitives and present subjunctives, the user should also submit to a detailed analysis all occurrences after the comment {zero that&Oinf}.

4. The praxis of variationist annotation

The general assumption in the present praxis is that users of the CSC database are interested in tracing developments which take place over a relatively long time-span, such as grammaticalization processes. Therefore, they will hopefully benefit from an annotation system which indicates the various stages of these developments, starting with the origin, listing properties perceived in the analysis of pragmatic inferences, identifying examples which provide evidence of an ongoing process of grammaticalization, and perhaps even indicating the grammaticalized use.

Co-ordinates on a cline

  • The tags indicate fuzziness and polyfunctionality by referring to co-ordinates on a cline (no need to insist on membership of a single category).
  • The order of these co-ordinates is carefully controlled, and the hierarchy between different types of information is transparent; core properties precede components providing contextual information.
  • The co-ordinates permit the positioning of a particular feature in variational space and the tracing of developments over time.
  • No tag is merely an interpretation of a particular occurrence in a particular context, but provides information about all the different stages, faithfully reflecting historical continua.

It should be stressed that the CSC annotation does not function as a grammatical description; instead, the sole purpose of the praxis adopted here is to facilitate data retrieval for diachronic studies dealing with variation and change in data which does not permit analysis within predefined and easily identifiable units such as sentences. While the praxis indicates links between features which frequently co-occur (e.g. multi-word verbs and their complementation), it does not deal with idiomaticity and related semantic change. Prepositional ditransitive verbs which have developed from object + adverbial collocates (Meurman-Solin 2000b) are tagged by marking the units in these collocates with arrows, leaving the adverbial origin unmarked only in cases in which the function of a prepositional object is the only attested one in the database:

'to putt ane close to ane affair'








The use 'to put a close somewhere' has not been recorded in the CSC, and ane affair is analysed as prepositional object. See also the CSC Manual: The verb phrase and its complementation.

The examples in this article are intended to illustrate that, in the majority of cases, categorial fuzziness can be dispelled by describing the properties of a feature with reference to origin as well as context of occurrence, positioning them on a cline. It is suggested that this practice will allow the tagger to depict faithfully the data without resorting to the kind of streamlining which is unavoidable in traditional tagging and parsing methods. As stressed above, since tags should be as theory-neutral as possible and the role of each individual tagger as an interpreter of linguistic data should remain as low-profile as possible, the descriptive clines of properties are intended to permit the creation of more valid inventories of features than the compartmentalising tagging practices.


[1] Phonological relations can be reconstructed using an index of etymologies of the kind developed for the LAEME.

[2] During the process of transcribing, digitizing and tagging the manuscript originals of Renaissance letters by writers originating from Scotland, it became obvious that the data was quite different from that available in editions. As in numerous editions of other — literary and non-literary — historical documents, editorial principles applied to letters permitted 'normalization', i.e., modernization, of various kinds, the typical areas subjected to tacit editorial interference being the expansion of contracted forms, and the application of modern rules in punctuation and the choice of lower and upper case. The study of a categorially fuzzy and polyfunctional feature such as connectives is however dependent on the authenticity of sentence structure in the data. (Meurman-Solin 2007: 263)

[3] The concept of visual prosody refers to non-linguistic features, such as properties of manuscript layout, paragraph structure, punctuation, particular character shapes, and spacing; these features are marked with comments in the digitized manuscripts.

Non-linguistic features

  • damage
  • number of folio
  • line-break
  • punctuation
  • paragraph structure
  • spacing
  • marked character shape
  • insertion
  • cancellation
  • correction
  • position in margin, before or after the body of the letter
  • idiosyncratic features of a particular hand
  • change of hand

[4] Given that punctuation in historical documents is not sufficiently regularized to allow the reconstruction of clause and sentence structure, these illustrations permit us to draw the conclusion that a thorough understanding of both implicit and explicit connectivity and visual prosody is necessary to learn to identify these structures. In text languages, a careful examination of the visual features of the original texts may provide useful, even indispensable, information for a reliable reading. However, the assessment of their relevance is by no means easy, and converging evidence of various kinds will have to be provided to create valid criteria for such an analysis. (Meurman-Solin 2007: 266)

[5] The explicit marking of semantic relations between sentences and clauses is a strikingly common feature in early letters, zero marking being much less frequent in the historical data examined than in similar registers in Present-day English (Meurman-Solin 2004). As regards the CSC data, the highly elaborate clause-linkage system can be illustrated using the following extract from a late seventeenth-century letter. As illustrated by this extract, the reconstruction of the full variational pattern of clause linkage requires the signalling of the presence or absence of text-level connectives (underlined and in green, absence being indicated by [ZERO]). Text-level adverbial connectives have been highlighted in red, and subordinating connectives in yellow. Zero complementizers have been indicated with the comment [zero].

as for our Couk. that man [zero] you feid \ at Edinburgh you kno befor you went from Chanain \ I told you [zero] I wad give him his live becase his \ conditions wase tou much . and sins I wase in\forming my self about. him, and thay tell \ me [zero] he is a {an unidentified word; correction unclear} illretired Cheild \\ yet I writ to gorge biset [zero] if he wase satisfyd wt \ the 6 pound I wase content to tak him. but \ sinse I cam hear he cam and ofired him self \ and I told him that I had advertisd him befor \ that I thoght his condition tou much, yett \ wase content to give him mor then 6 pound \ if he wad tak fei for all becase his casawalatis \ wad a bein mor then twelve pound. and when \ we give casawalatis pipoll talks of it mor \ then it is wirth. hou ivir he wold not {an inserted word unclear} \ a jot, so I gave him ovir, and I have non at \ all for the preasent, for he that wase \ at scello wase not good. and robin whait \ will not ingage [ZERO] he is but a weik lad. but he \ comse in and douse any thing [zero] we disair, him \ so if you could get Cheif perswadit, I think [zero] \ what he hes learned in inglend wold do \ well anof in scotland, but if he, will not \ I think [zero] it is best [zero] you seik for on therfor betir give a good fei to on that can do well \ then on that can not. if you war thinking \ to stay in the south, I fansi [zero] a woman \ kouk wad do best, becase we wold not \ have brewing or beking or kiling of meat \ and awoman I belive wad fei for les then a \ man. not having that to do, and she wold \ help to dres roums or any other thing \ [zero] war to be doon. but if we stay hear a man \ I think wad do best, but when you fei \ him , give him nathing but fei for \ when thay get other casawalatis \ {f1vb} thay straiv to spend mor then thay wad do otherway{ins}se? \ [ZERO] we gave 6 pound last. and you kno 8 pound wase the \ most that ivir ve gave but you may straiv to get them \ ase cheap ase you can. [ZERO] I onse heard m=rs= havircemp say that \ the duchis of latherdeall had awoman Couk that servd hir \ hous . for ase grit a family ase it wase, if it be a man kouk \ and if we stay in the north. I will give him aman and a \ litill booy undir him. and whither man or woman, if we \ stay at Edinburgh I will give them only a lase undir them. if \ I get the undir kouk of my oun Choosing I will pay the \ fei and if thay choos them thay most pay the fei them \ selvs. but I rether choose them my selfs becase I \ will get them cheapest. [ZERO] you may tak yr sistirs adve?is \ in this. [ZERO] that man that you feid at Edinburgh wase serving hir \ onse sa she will kno if he be ill or not, I haue trubiled \ you tou much with this subjek.

(NLS MS Dep 175/70/Bundle 3/1990; Margaret Mackenzie, 1688 (see Meurman-Solin 2004: 178-179)

[6] 'beseiking als that the eternal god haif zour grace in keiping'

$beseek{cause}{lat}/ vpsp{indep} _*BESEIK+ING $/vpsp{indep}_+ING







$have{n}/vsjps13<cnp+{nom}>pr-cj_HAIF $/vsjps13<cnp+{nom}>pr-cj_0




$keep/vn{rc}-av_KEIP+ING $/vn{rc}-av_+ING <

[7] Hopper & Thompson's view on nounhood (2004: 251) stresses the scalar nature of the concept:

'An apparently prototypical N like fox is not in fact prototypical in all instances of its use. It is this variability in N-like properties over discourse which suggests that any absolute, non-contextual division of the lexicon into 'noun stems' and 'verb stems' will have only a limited validity.'


'We hope to present evidence here that the lexical semantic facts about Ns and Vs are secondary to their DISCOURSE ROLES; and that the semantic facts (perceptibility etc.) which are characteristic features of prototypical Ns and Vs are in fact derivative of (and perhaps even secondary to) their discourse roles. This morpho-syntactic evidence will show that the extent to which prototypical nounhood is achieved is a function of the degree to which the form in question serves to introduce a participant into the discourse.'

[8] The element 'k' in tags is attached to head items in compounds.


