3.2.2 The rationale of elaborating tags

The decision to apply a system of elaborated tags to the CSC data is based both on knowledge of the language-external factors that are reflected in the types of data the CSC comprises (see also sections 1.3 Dimensions of space, time and social milieu, 1.4 Letters as a data source and 2.2 Selection of data for the CSC) and on the compiler/tagger's interest in systematically detecting patterns of variation and change.

The CSC provides data for the reconstruction of a wide range of idiolectal and local grammars of which no previous description is available. In addition to the conditioning imposed by time, space and social milieu, the generally high degree of variation in correspondence is further increased by significant differences in the informants' linguistic and stylistic competence. Considering that many informants can be categorized as inexperienced or poorly-trained writers (Meurman-Solin 2001, Meurman-Solin and Nurmi 2004), it may be more appropriate to refer to variation in their linguistic and stylistic literacy. Data extracted from original manuscripts provide evidence of hesitation as regards spelling; this feature is particularly salient in the case of less frequent loanwords, but in many idiolects even a highly frequent lexical or grammatical item can have a number of variants used interchangeably even in the same letter. These letters also include other evidence of low literacy levels, such as numerous cancellations and insertions, redundancy in the system of anaphoric reference, and consecutive syntactic mergers with incompatible syntactic structures. These reflect less-experienced writers' struggles not only with polite epistolary formulae, but also with the written medium in general. These writers also quite frequently use spellings which may be based on pronunciation (Meurman-Solin 1999, 2000, 2001, 2005).

In manuscript data, which is naturally free from standardizing trends adopted by early printers or later editors, the great degree of variation is a major challenge, calling for a rigorous philological approach in the reconstruction of patterns of variation and change. In principle, each idiolect will have to be examined on its own, without assuming that there will be similarities or, for that matter, particular differences between members of a particular speech, discourse or text community (Meurman-Solin 2004c; see also Section 1.3 Dimensions of space, time and social milieu).

Numerous women writers use their writing skills as members of speech communities. Sixteenth-century Scottish women, for instance, 'mainly used their writing skills for writing letters to their relatives, and, somewhat later, for keeping accounts and summarizing the daily events in their personal diaries. In this case, language use can be assumed to be essentially conditioned by the restricted social functions of writing' (Meurman-Solin 2001: 16). (For information on Scottish women's literacy and education, see Marshall 1983 and Houston 1985.) The higher ranks of the gentry may have had regular correspondence with their peers, government officials and the court, and these social networks created channels for the spread of features specific to epistolary discourse. Professional writers such as lawyers lived in a world constructed by shared texts and conventionalized practices. Categories like these, despite their crudeness, may suggest that it is possible to detect some uniformity in the choice of linguistic features between members of the same community type in a particular geographical area at a particular time. Yet the main corpus-based findings of my earlier research have confirmed that no straightforward correlation between language-external variables and linguistic preferences can be assumed; instead, a very sensitive tool is required to trace patterns from the micro- to the macro-level, with time and space (with distance integrated in the latter) as the primary variables and others such as community type and gender as secondary.

In assessing the quality of data in order to choose the annotation praxis, a particularly useful diagnostic tool has been provided by the various pilot studies in my work towards creating variationist typologies of connectives (Meurman-Solin 2002, 2004 a and b). In contrast to research based on, for example, edited letters that have been normalized by introducing modern punctuation, descriptive work drawing on manuscript data has shown that there are significant differences in how thoughts are processed in terms of sentences and clauses in different time periods. Since punctuation in historical documents is not sufficiently regularized to allow the reconstruction of clause and sentence structure, a thorough understanding of both implicit and explicit connectivity and visual prosody is necessary to enable us to assess these structures.

While essentially I drew on the system of basic tagging created at the IHD, these aspects of the CSC data motivated me to apply an elaborated annotation system which aims to provide data retrieval tools which facilitate discoursal and textual as well as morpho-syntactic analyses. The chief motivating factor for this elaboration was to provide sophisticated instruments for the identification of all the potential members of patterns of variation, describe their structural and contextual features as accurately as possible, discover the shape of a particular linguistic system or subsystem by drawing on such data, and finally interpret that shape with reference to linguistic and extralinguistic factors. It is also suggested that a methodological approach of this kind will permit the formulation of a theory of variationist typology.

The rationale for elaborating tags is that the property strings selected will permit the user to trace variation and change in data covering a long time-span and representing geographically extensive and socially complex areas. The varying realisations of a particular linguistic feature attested in the data can be grouped together using software tailored to search for a particular component or a set of components in the tags. I believe that, by using semantic criteria rather than those defined by modern syntactic theory, it is possible to identify all the members of a particular pattern of variation. Elaborated tags make the interrelatedness of the members of a particular notional category explicit. However, they do not provide conclusive evidence for categorization or sophisticated syntactic analysis. Instead, the properties indicated in the tags can be selected, and a system of strings of structural and contextual properties can be constructed in order to perform precisely defined searches.

The choice of features subject to elaboration in the CSC reflect the present tagger's theoretical and methodological thinking, as well as the research questions she has considered particularly intriguing in her own work during the process of compiling and annotating the corpus. For example, my earlier research on syntax has suggested such focus areas for elaboration as structural relations, polyfunctionality and categorial fuzziness. However, each user can define the appropriate degree of elaboration by ignoring details that are not relevant in a particular research, or, alternatively, design a search list by selecting only the ones that are considered relevant from the comprehensive inventory of tags that appear in the corpus. Users will also be able to revise or refine the tags in various ways to suit their particular study. In this sense, flexibility is an integral feature of the CSC system. In fact, since the system of basic tagging in the CSC is largely the same as in the databases created for the two above-mentioned linguistic atlases, it is possible to use only this basic information in data retrieval. In other words, the user can select components ranging from basic to elaborated information according to the search type they consider most appropriate for a particular research question. The fact that the elaborated tagging is optional and that there is potential for users to interact with the system will hopefully ensure that the CSC is flexible enough to be used for various types of research. Section 3.3 Practices of tagging illustrates the tagging practices by type of linguistic feature, permitting the user to examine how information about core properties specified in accordance with principles of simple tagging can be enriched by providing more detailed information in either the lexel or the grammel of a tag.

The following example illustrates this focus on patterns of variation and change. The inventory of items that share the notional feature of generic animate reference consists of variants such as the following:

  • who (that)
  • who(so)ever
  • any man/person who
  • a person who
  • he/they who [with generic reference]

It is useful to relate items such as these to each other by adding a semantic component to the tags. Let us consider the following example, extracted from the Helsinki Corpus of Older Scots (HCOS):

'Herfor quha Þat has nocht luf & frende he has nathing'
(HCOS 1490 Porteous Noblenes, 178)

In this example, 'who that' is used with generic reference, the proverbial text addressing any person who lives without love and friends. WHO and THAT are tagged separately, the splitting of elements being necessary because of the wide range of variant collocational patterns. Generic reference is signalled by the semantic comment '{+h0}'. Relations between tags are signalled explicitly to permit searching by pairs of tags: in addition to the link between the pronoun WHO and the relative element THAT, there is a link between WHO and HE.




The zero is positioned in the slot for number in the comment in order to indicate generic reference. It should be noted, however, that a zero in this position is multifunctional, since it can also refer to an indefinite antecedent or a collective. This decision is justified in the CSC data, where generic reference is an infrequent feature compared with early instructive or legal texts in the HCOS.

Economy of tags is an important issue even in elaborated tagging. This can be illustrated by the present tagger's conceptualization of particular properties as default features of a particular linguistic structure. For example, the positioning of the antecedent adjacent to the relative pronoun in relative constructions has been considered a default feature, while non-adjacency is explicitly indicated in the grammel (see Section Comments within tags).