2.1 Protean corpora: multidimensionality, flexibility, and transparency


Protean, A.
adj. a. Of or pertaining to Proteus; like that of Proteus; hence, taking or existing in various shapes, variable in form; characterized by variability or variation; variously manifested or expressed; changing, varying.
proteanism n.,
capacity for change; changeableness, variability.

In addition to the research environment at the Research Unit for Variation, Contacts and Change (VARIENG), University of Helsinki, the Corpus of Scottish Correspondence was created in close cooperation with the Institute for Historical Dialectology (IHD) at the University of Edinburgh, where there is long-standing expertise in the creation of linguistic atlases of text languages ( The tagged databases produced as part of the Edinburgh-Helsinki collaboration represent a new genre of electronic corpora. Firstly, they are Protean in the sense that they can be continually revised and expanded. Secondly, building on the basic format of a particular corpus, the data can be manipulated into a virtually unrestricted number of structures, "research shapes", in order to achieve the best possible validity and relevance for specific, user-defined investigations. In other words, a corpus re-shaped for a particular study may contain only those parts of the base corpus which the user considers appropriate for dealing with a particular research question. In fact, unevenness as regards the validity and relevance of data is an inherent quality of most electronic databases, even though this may not be explicitly indicated or carefully explained in the manuals. For example, some parts of a database may be shown to be more valid than others in terms of language-external criteria, and the user should perhaps exclude texts which may weaken scholarly argumentation in a particular study (on the assessment of representativeness, see Meurman-Solin 2001).

Thirdly, proteanism is reflected in the tagging system tailored for the CSC (see Section 3.2 Principles of tagging and Section 3.3 Practices of tagging). Since information in the tags is hierarchically ordered, it is possible to decide what degree of specification or refinement is required for a particular search. The user can also rearrange this information or re-tag the linguistic features under investigation by using the software provided on this website (Software). Thus, the tags allow refinement and enrichment, so that, even though the basic information in the tag is structural and semantic, the additional information provided about contextual properties and the commentary (see Section 3.4 Commentary) also permit the user to search for syntactic and textual features.

This section will discuss how such key properties of electronic corpora as flexibility, multi-dimensionality and transparency are reflected in the digitization and tagging principles and practices of the CSC (for a detailed description of the tagging system, see Part III).

In my view, a good balance between the type of research question asked and the type of data retrieved can be created by compiling transparent, flexible and multi-dimensional corpora. Transparency in a corpus allows the user to assess carefully and critically the validity and relevance of each text with regard to specific user-defined linguistic investigations. Flexibility in a corpus allows the user to manipulate the data into a specific form in order to achieve the best possible fit between the data and the theoretical and methodological approach. Multi-dimensionality in a corpus allows the user to restructure the data by re-creating an appropriate frame of reference based on how language-external variables have been conceptualized and defined.

2.1.1 A corpus and its various forms

As stated above, Protean databases can be reshaped and restructured by the user to achieve the best possible validity and relevance for the study of a specific research topic. While the earlier-generation corpora can perhaps be seen as carefully structured end-products of compiler-defined corpus compilation projects, Protean corpora such as the CSC are databases of digitized texts which, in addition to the basic format, can also exist in various user-defined forms. As Keith Williamson (p.c.) has put it, we would like to put over the image of a corpus as a set of texts in a relative, multi-dimensional universe.

According to this approach, the basic format of the corpus remains separate from the various “research forms”. However, even the basic format may undergo changes as part of the ongoing process of development, being revised and expanded at regular intervals. The Protean character of a database of this kind draws on a kind of flexibility which is achieved as follows: it is possible to take a copy of all or part of the basic format and alter the tagging itself, or add to the tagging information at any level from word-morpheme upwards. Flexibility of this kind is an important asset in the study of syntax in particular, and indispensable in diachronic studies. It is also possible to add information that relates to extra-linguistic factors and analyse, define or group these factors in a different way. We may want to create a form which comprises strong witnesses only, with the degree of strength assessed by how precisely a language-external variable, or a set of them - features related to textual history or those related to informants, for instance - can be defined (see Section 1.1 Reconstruction of text languages and Section 1.3 Dimensions of space, time and social milieu).

In principle, reshaping of this kind is also possible with many of the existing corpora, but in their cases the reshaping is restricted primarily to the selection and classification of texts according to language-external variables. For example, the user may wish to research academic prose written by women in the age-range of 40 to 60 in the state of California, and he or she may then proceed by extracting a sub-corpus out of existing electronic corpora. However, proteanism is more deeply integrated in multi-dimensional corpora, as all aspects of these may be reshaped and redefined; perhaps the most important difference is that in creating a new form, new knowledge, whether related to texts, informants, the tagger's grammar or language-external variables, can be immediately keyed in. We also consider it useful that the user can autonomously define what he or she considers to be relevant knowledge. Thus, compilers provide users with as much information as is available to them, but each corpus user can then critically examine the implications that information has for the specific study at hand. As I have argued elsewhere (Meurman-Solin 2001), in my view, over-structuring corpora by using hypothetical knowledge in the definition of language-external variables is not useful. Each user should formulate the definitions in full accordance with his or her theoretical and methodological approach. In general, the creation of a separate – non-integrated – database containing information about the texts is welcome, as the availability of more detailed information about texts allows the application of a less compartmentalized and more scalar way of conceptualizing language-external variables in corpora (see Section 1.3 Dimensions of space, time and social milieu).

I see the compilation of Protean corpora as an ongoing process, with dynamic interaction and critical reanalysis between the stages of compiling, experimenting, revising, restructuring and expanding. The most important requirement in creating a multi-dimensional corpus is that the principles regulating the structure, as well as those guiding compilation, digitization and annotation practices, are as transparent as possible. Transparency is enhanced, firstly, by introducing some degree of hierarchical ordering into how language-external variables have been conceptualized. We have decided to consider time and space to be central, since the creation of a frame of reference consisting of diachronically- and diatopically-anchored texts is essential for the identification of valid information that can be used to position other texts in the linguistic and extralinguistic worlds reconstructed in the corpus. Ideally, in addition to being localizable and having a particular date, anchor texts will be spread relatively evenly over time and space and have a relatively fixed social and communicative function; in addition, it will be possible to reconstruct the profiles of the writers, whether members of discourse communities such as professional coalitions or individuals whose language is only available in private documents, on the basis of reliable, preferably direct, evidence. In the Edinburgh Institute for Historical Dialectology, the notion of “primary witness” is used where the LALME refers to anchor texts. A primary witness is a text that can be localized on the basis of prima facie extra-linguistic evidence of an association with a place and given date – where there is (for the time being) no contradictory evidence. That is, any anchor text or primary witness is a working hypothesis. A “secondary witness” is a text which is localized linguistically, as it lacks any or sufficient extra-linguistic indications of its provenance and/or date (Williamson 2000, 2001, Laing and Williamson 2004).

The notion of witnesses is useful, as it allows the possibility of new evidence that may alter the strength of the case for localizing a text and, indeed, the shape of the corpus – the pattern in which texts float in their multi-dimensional space cluster. The idea is that texts will be positioned in the multidimensional space of a corpus world according to a set of coordinates that have been defined either in binary terms, i.e., in terms of a dichotomy, or, if possible, through use of a scalar system. In principle, a text is thus not a permanent member of a specific group or category, filling a slot in the compiler's schema. A text is floating in the corpus space and can be fixed for the purpose of a specific study by showing that there is a valid relation between the text, the language-external variables defining it, and the research question. The user of the corpus may see the various dimensions in terms of a hierarchical system, finding some of them particularly relevant and others marginal or not valid for a specific research hypothesis. The user may also see some dimensions as more closely interrelated than others; he or she may claim that some binary variables are independent, while others form a network in which the conditioning effect of one is dependent on the converging effect of another.

Thus, I would like to suggest that the fourth generation of corpora will combine three important properties. Firstly, we define language-external variables rigorously, benefiting from information provided by various interdisciplinary forums. Secondly, we see corpora as consisting of sub-corpora that are defined not in terms of time periods, for instance, but in reference to degrees of validity and relevance as regards their usefulness for the study of a specific research question. Thirdly, instead of marketing corpora as completed products, we see the compilation as an ongoing process, and therefore view expansion and revision as inherent characteristics of this work.

I see these three properties as interrelated. Our understanding of the complex nature of language-external variables has increased, so that we are more aware of their scalar nature, for instance, and, as a result, find some of the traditional category labels less useful, sometimes even misleading. While some variables can be defined quite precisely, others are still based on hypotheses or knowledge which draws on as yet only partially-reconstructed stages of social, cultural and economic history. Research questions can be examined using data with varying degrees of relevance, depending on how thorough our knowledge is of specific language-external factors.

2.1.2 Transparency of the theoretical and methodological approach

In addition to the careful assessment of whether the basic form of a corpus provides relevant data for the study of a specific topic, an assessment of the validity of the corpus is also necessary in order to ensure that there is no theoretical and/or methodological contradiction between the approaches of the corpus compiler and the corpus user. It is perhaps not altogether unjustifiable to ask whether methods developed by modern sociolinguistics, dialectology or discourse stylistics, for instance, can be applied to data that has not been compiled with the theoretical framework of these fields of study in mind. Perhaps the best way to illustrate what I mean is to refer to the example of a corpus that has been structured to rigorously reflect recent theoretical and methodological developments in historical sociolinguistics. I consider the Corpus of Early English Correspondence to be such a corpus (Nevalainen and Raumolin-Brunberg 1996, 2003).

In my own corpus-linguistic work, I ask what theoretical and methodological implications different text annotation systems might have on inventories of particular linguistic features when applied to digital databases. How is our ability to understand linguistic systems affected by the use of quasi-automatic taggers and parsers which may re-establish and re-distribute conventionalized ways of understanding and analyzing and categorizing linguistic data? Ideally, tags should guide us towards the reassessment of our criteria for linguistic categorization, rather than provide data as categorized by criteria based on preconceived properties of linguistic features (for information on flexibility and transparency in the CSC tagging system, see Section 3.2 Principles of tagging and Section 3.3 Practices of tagging).

The elaborated tagging system in the CSC aims to be agnostic with respect to schools of modern formal syntactic theory. Attempts to revise the guidelines for philological computing has been motivated by the following observations: while electronic databases have constantly improved as regards their quantitative and qualitative validity and relevance, compromises have sometimes been made in tagging by relying on pre-corpus-linguistic descriptions, resorting to automatic (i.e., non-interactive) tagging, or imposing neat category labels on the data. The main principle in the CSC tagging system is that as little linguistic theory should be integrated into a tagged corpus as possible. In other words, the tagging in the base corpus should, as far as is possible, remain neutral with respect to formal theories, particularly those of syntax, as tags reflecting assumed syntactic properties will inevitably suggest membership in a preconceived grammatical system. Meurman-Solin (2004a: 187) summarizes the principles for tagging connectives in the CSC as follows:

this tagging system aims at indicating item-specific or collocate-specific structural features which have been interpreted as having semantic potential to indicate relations between clauses, irrespective of degree of grammaticalization. The rationale for not providing information about syntactic properties is that these are interpreted as secondary, while structural and semantic properties are considered primary. In other words, the core function of structural and semantic information is descriptive – descriptive at the micro-level – while that of syntactic information is interpretative – interpretative at the macro-level, i.e., intended to identify grammatical rules and constraints. The description provided by the tags may contain information on various levels of language use, including discoursal and textual features.

In cases in which it has not been possible to avoid theory-specific practices, these must be made as transparent as possible by providing detailed information about the tagging principles and practices. The main function of the tags is that they permit the creation of comprehensive inventories which are valid for the study of variation and change. Thus, they ensure reliable data searches, rather than supporting a particular grammatical analysis.

The principle of transparency is also applied to the way in which ambiguous instances have been tagged. As discussed in more detail in Section 3.2 Principles of tagging and Section 3.4 Commentary, categorial fuzziness and polyfunctionality are dealt with by using a cline of co-ordinates reflecting the different readings in the grammel of a tag. Thus, instances of 'any man' can be integrated into the inventory of indefinite pronouns by positioning the term on the cline of nouniness and pronounhood using the co-ordinates 'n-pn':



Ambiguity can also be indicated by a comment which makes the alternative readings explicit:

$beseek{cause}{lat}/vpsp{indep}_*BESEIK+ING $/vpsp{indep}_+ING


{zero that&Oinf}





$have{n}/vsjps13<cnp+{nom}>pr-cj_HAIF $/vsjps13<cnp+{nom}>pr-cj_0




$keep/vn{rc}-av_KEIP+ING $/vn{rc}-av_+ING

The comment {zero that&Oinf} specifies the two possible readings of the complement of the verb beseek. The alternative consisting of a nominal that-clause object with that-deletion has been randomly chosen as the one presented first. This order is reflected in how the rest of the clause elements have been tagged, i.e. the predicate verb of the proposed that-clause is analysed as a present subjunctive. The other alternative is the reading of the complement as a Latinate object + bare infinitive construction. Choosing just one of these alternatives would mean imposing a particular grammatical analysis and ignoring the other. See also Section 3.4 Commentary.

Attributes such as Protean, multi-dimensional, flexible and transparent usefully remind us of the risks of objectifying language varieties in the compilation of corpora in the way that the dictionary industry often does (Benson 2001: 21). I have discussed these risks elsewhere (Meurman-Solin 2004b), so I will just summarize some of my comments here. As also pointed out in Section 1.1 Reconstruction of text languages, there is a tendency to objectify or reify regional varieties, assuming that they form relatively homogeneous, even relatively self-contained, entities or systems; to historicize them by emphasising socio-political rather than linguistic factors and by presenting these factors as legitimizing the naming and describing of regional varieties in a certain way; and to create hierarchies, analysing a regional variety chiefly in reference to a standardized variety or adopting the comparative method in examining less prestigious varieties (Milroy 1999). See also Williamson 2004.

These tendencies may regulate processes of analysis by which linguists are trying to identify some order in heterogeneity, i.e., some relatively consistently preferred practices in data that otherwise chiefly give evidence of heterogeneity and continued variation. Attempts to demarcate areas as the territories of specific varieties may divert our attention from the examination of ordered heterogeneity which can be observed only by crossing such artificial boundaries. By defining a text community in terms of which written texts verifiably had a social and communicative function among the literate members of that community, and by using such a representative compilation of texts as data (cf. Section 1.3.4), it is possible to deobjectify and dehistoricize a language variety. To refer to Scots as an example, a particularly important consequence of de-reification in the description of this language is that variation and variety resulting from contact between varieties and languages on Scottish soil will be given due attention.


