Varieng Home

3.2.1 Introduction

Electronic databases have been improving continuously as regards their quantitative and qualitative validity and relevance. This general remark applies to corpora representing present-day language varieties as well as those comprising historical texts. Their size, balance and representativeness have improved, authenticity has become a widely-discussed issue, and there is at least some consensus that language-external variables should be defined by drawing on information provided by interdisciplinary research. Nevertheless, it seems to me that compromises have sometimes been made in the annotation of corpora. For example, a tagging system may rely on pre-corpus linguistics grammatical descriptions, resort to automatic (i.e., non-interactive) tagging, or impose neat category labels on linguistic features. (cf. Meurman-Solin 2007 http://www.helsinki.fi/varieng/journal/index.html)

The compilation principles and practices applied to the Corpus of Scottish Correspondence (CSC) aim to reflect as much philological rigour in ensuring the authenticity of data by consisting exclusively of diplomatically transcribed digital manuscript sources as in annotating the linguistic and non-linguistic features of these texts. An endeavour of this kind was motivated by the following observations: there is always a risk that tagging may have a streamlining effect on complex data, possibly even distorting evidence by applying overly rigid rules in categorization, so ignoring the inherent fuzziness of categories; it may also simplify complex patterns of variation or fail to reflect processes of change over a long time-span. These problems may be particularly challenging in the tagging of corpora for the study of those regional and local varieties of English which differ significantly from previously described standardized varieties. One of these is Scots, an internally quite heterogeneous variety of English.

The obvious reason for resorting to automatic or semi-automatic tagging is that it makes the process less time-consuming for the compiler/tagger. Even if the categorial ambiguity of particular linguistic features has been perceived, an excessively streamlined tagging system may not provide any instruments for dealing with such ambiguity. Problems of this kind are often left unsolved because the tagger does not want to create a categorization system which is unduly complex, or, as we often put it, not sufficiently user-friendly. At present, the most widespread practice in the tagging of corpus data is for the annotation to be based on pre-corpus linguistics grammatical descriptions. In other words, the category labels we use in data retrieval are based on non-corpus-based grammatical descriptions, or on standard reference works which draw on a limited set of data, or even on a corpus which is not representative of the language variety being tagged. In an approach of this kind, compromises are probably unavoidable.

Nonethless, by ignoring the inherent fuzziness of categories in our tagging, we may lose a lot of important information. For example, we may simplify complex patterns of variation and fail to trace processes of change over a long time-span, in particular those which reflect category change. I have become acutely aware of this problem in diachronic studies of grammaticalization and lexicalization processes in particular.

The general approach in the tagging of the CSC is virtually the same as that adopted in the databases created for the Linguistic Atlas of Early Middle English (LAEME) by Margaret Laing (Laing 1993, 2002, 2004; Laing and Williamson 2004), and for the Linguistic Atlas of Older Scots (LAOS) by Keith Williamson (Williamson 1992/93, 2000, 2001, 2004, 2005) at the Institute for Historical Dialectology (IHD), University of Edinburgh. A contract signed by the universities of Edinburgh and Helsinki has permitted the use of software developed by Williamson in the tagging process. The general theoretical assumptions on which the tagging is based are also shared by these three databases.

Due to a somewhat different range of research questions, triggered by the genre-specific properties of epistolary prose, the data in the CSC has been subjected to more elaborate tagging than the system applied to the Edinburgh Corpus of Older Scots for the LAOS and to the texts in the LAEME. The collaborating compilers of the three databases see the application of alternative tagging systems as highly appropriate. The fact that the software created by Williamson has been applied in somewhat different ways in the three databases illustrates how flexible and multi-dimensional it is. Such flexibility is a great advantage to new users, and the three alternatives demonstrate how each user can interactively revise a particular set of tags to search for data which is relevant to the research question at hand. The differences in application also demonstrate the principle that tagging should be sensitive to the idiosyncratic language use in each database. To achieve such sensitivity, a thorough knowledge of the language variety or varieties being tagged is required. In our projects we have benefited from the fact that the roles of transcriber, digitizer and tagger were combined.

I would like to remind users of the CSC database that the chief aim of the lexico-grammatical annotation of the data is to enable maximally reliable data retrieval. Thus, the annotation does not draw on thorough research on the linguistic features and systems and therefore will always have to be considered as merely a tool for the creation of relevant inventories.

This section will examine the principles of tagging, while Section 3.3 Practices of tagging aims to make the tagger's grammar transparent by illustrating the feature-specific practices. I will start by discussing the factors which motivated me to make various degrees of elaboration available in the system of tagging that is applied to the present version of the CSC (Section 3.2.2). I will then examine a number of focus areas in my theoretical and methodological approach (Section 3.2.3), including a discussion of how ambiguity, fuzziness and polyfunctionality have been dealt with, and finally describe the general structure of tags, introducing the principles which guide the choice of lexels and the formulation of grammels (Section 3.2.4). Section 3.3 Practices of tagging aims to make the tagger's decisions transparent by describing the feature-specific practices.