The Research Unit for Variation, Contacts and Change in English (VARIENG) at the Department of English, University of Helsinki, hosted the 27th conference of the International Computer Archive of Modern and Medieval English (ICAME) from the 24th to the 28th of May, 2006. The pre-conference workshop focused on the topic of corpus annotation, the aim being to examine the theory and praxis of various types of annotation from both the tagger's and the user's perspective. In addition to disseminating information about recent developments in the area, the workshop was intended to provide a forum for discussing the problems users have had with annotated data. We also wanted to shed light on the more theoretical concerns related to the principles and practices of linguistic categorization, particularly dealing with categorial fuzziness, and invited Professor David Denison (University of Manchester) to give a keynote speech on 'Playing tag with category boundaries'.
One of the key areas discussed at the workshop was semantic-pragmatic tagging, which was the focus of Dawn Archer's paper on annotating historical texts, among others. For semantic-pragmatic tagging, see the UCREL Semantic Analysis System (USAS) (http://ucrel.lancs.ac.uk/usas/), Archer & Culpeper 2003, Archer & al. forthcoming, Culpeper & Kytö 2000, Löfberg & al. 2004.
Given the very limited time available for the workshop, only six papers could be included in the programme. Therefore, in addition to four of the papers presented in Helsinki, six new contributions on corpus annotation were solicited for the present publication. This volume on Annotating variation and change is the first in the e-publication series Studies in Variation, Contacts and Change in English, launched by VARIENG in 2007. The series is part of the e-VARIENG project, which provides a new platform not only for the publication of digital resources and virtual teaching materials, but also for disseminating information on international cooperation and achievements in the field of corpus linguistics.
VARIENG has been involved in the production of numerous corpora, the pioneering work having focused on the compilation of the Helsinki Corpus of English Texts (HC) and the Helsinki Corpus of Older Scots (HCOS). More recently, the Corpus of Early English Correspondence Sampler (CEECS), Middle English Medical Texts on CD-ROM, and the web-based Corpus of Scottish Correspondence (CSC) have been made available internationally. The Helsinki Corpus of British English Dialects (HD) came out in 2006. The tagging and parsing of the HC resulted in the York-Toronto-Helsinki Parsed Corpus of Old English Prose, the Penn-Helsinki Parsed Corpus of Middle English, and the Penn-Helsinki Parsed Corpus of Early Modern English. The editors of this volume have also had experience of corpus annotation: Arja Nurmi worked on in the Parsed Corpus of Early English Correspondence (PCEEC) in collaboration with Ann Taylor, University of York, using POS-tagging and parsing based on the Penn Tree Bank, and Anneli Meurman-Solin worked on the Corpus of Scottish Correspondence (CSC), using software created by Keith Williamson, University of Edinburgh.
Until recently, corpora were usually compiled first and then tagged and parsed in a separate project at a later stage. Nowadays, there are a relatively large number of examples of annotation being integrated into corpus creation, the rationale being that this means that the theoretical and methodological approach reflected in the compilation principles and practices is also coherently and consistently applied to the annotation of the texts (these corpora include A Linguistic Atlas of Early Middle English (LAEME) and A Linguistic Atlas of Older Scots (LAOS); see Beal & al., Meurman-Solin and Stenroos in this volume).
The prevailing trend is to enhance comparability between corpora by applying internationally recognized and applied standards during annotation. The prioritizing of standardized practices can certainly be justified now that researchers tend to use a wide range of corpora when examining a particular linguistic feature in a particular study.
However, flexible and interactive annotation systems also have quite obvious advantages, such as permitting the researcher to revise and/or elaborate the tags in order to achieve an ideal fit between data retrieval and a particular research question. The interactive dimension of annotation is, of course, only possible if software for revision and elaboration is made available as part of the package; for examples of this, see the LAEME and LAOS projects mentioned above. Ideally, corpus annotation would include both a fairly standardized basic annotation of various levels of language and a mechanism which allowed elaboration by the end user.
Annotation systems have also been developed which can be applied across languages (e.g. Pilz et al. 2007).
While linguistic annotation is still primarily based on lexico-grammatical tagging, semantic, pragmatic and text-linguistic information can also be inserted into tags or provided in tag-external comments. Both types of comment are typically bracketed in a particular way to distinguish the information they provide from lexico-grammatical annotation and to allow them to be used in data retrieval. In manuscript-based corpora, bracketed comments can also be used to indicate features of visual prosody, but they could probably be used for annotating the topographical features of present-day texts as well.
Text-initial parameters which define the values for a range of language-external variables and a highly structured system of classification according to these criteria are no longer the only option for providing information for the non-linguistic annotation of texts. Several recently compiled corpora offer detailed language-external information in a resource that is kept separate from the database of texts. Such resources are searchable, define the parameter values for each text in the corpus, and permit the texts to be grouped according to a selected variable or set of variables. The fact that these auxiliary files are kept separate means that they can be continuously revised and expanded as new knowledge about the texts is made available.
The ten papers in this volume provide a mixture of annotators' and end-users' points of view, illuminating new ways of benefiting from annotated corpora as well as pinpointing some of the problems of existing annotation schemes. Ideas for developing new types of annotation are also discussed.
The volume begins with a study by David Denison, who, through five case studies of on-going change in Present-day English, investigates the way in which various part-of-speech tagging schemes tackle (or fail to tackle) the problems of words which switch word class category. He illustrates his points using five different corpora, and discusses the problems of integrating linguistic indeterminacy into a set list of tags. Denison also suggests that it may be possible to catch linguistic change in the annotation process.
This point raised by Denison leads us to Sean Wallis, who provides an experienced corpus annotator's view, and develops ideas on how the elaboration and refinement of annotation schemes depend on the methodological position adopted by the annotator. Wallis further proposes that annotated corpora can become essential tools and targets of experimental linguistics, if we allow for the fact that the results of our research will inform the amendment of the annotation schemes used.
A further study of the elaboration of annotation, particularly in the context of language change, is provided by Anneli Meurman-Solin, who explores the various problems related to the creation and annotation of the Corpus of Scottish Correspondence. The scheme employed for this corpus involves the annotation of many different levels of language, from phrasal to textual. This paper tackles head-on the problems of fuzziness and polyfunctionality combined with change in progress.
Continuing the discussion of the challenges of annotating historical materials, Merja Stenroos discusses the special questions raised by the Middle English Grammar project, ranging from how to interpret letter shapes in manuscripts to how to define spelling units and headwords for the corpus. The inclusion of relevant extralinguistic information in a suitable format is also central to achieving the aims of the project.
The problems of dealing with language varieties which reflect a high degree of variation are equally evident when compiling and annotating a corpus of non-standard spoken English, as Joan Beal, Karen Corrigan, Nicholas Smith and Paul Rayson show in their exploration of the creation and annotation of the Newcastle Electronic Corpus of Tyneside English. Both the transcription of dialectal forms and the part-of-speech tagging of this particular variety presented the team with many challenges.
Also working with spoken language, albeit that of the past, is Magnus Huber, in his project to create a sociolinguistic corpus of Late Modern spoken English from the Old Bailey Proceedings. Some of the problems are similar to those investigated in Stenroos and Beal et al., in that the process needs to include information on such issues as methods of shorthand and the publication process in order to assess the closeness of the corpus material to spoken language. Furthermore, a layer of socio-pragmatic annotation must be included. As the nature of written representations of spoken language is always problematic, Huber also tests the degree of spokenness of his material and the consistency of scribal practices by comparing numbers of contracted forms with scribes and printers.
Continuing in the field of socio-pragmatic and pragmatic annotation, Minna Palander-Collin, an experienced corpus compiler herself, approaches the question of annotation from an end-user's point of view, discussing which variables are desirable to have coded in a corpus in order to make it easier to systematically study issues such as the interactional features of language and their social embedding in the context of speaker/writer and hearer/reader. She illustrates her points through discussion of two research questions.
In the case of a corpus as extensive as the British National Corpus, questions concerning annotation also exist on a larger scale. Ylva Berglund describes the changes and improvements made when converting the BNC from SGML to XML. The size of the corpus is only one of the challenges facing annotators; the different levels of annotation included (text provenance and reader/speaker and audience information as well as part-of-speech tagging) and the inclusion of spoken as well as written language each presented their own problems, not just in annotation, but also in the creation of retrieval tools.
There is still a great deal of work to be done in the search for ways of annotating more complex linguistic features. A type of wish list is presented by Arja Nurmi, who looks at the ways in which complex linguistic features could be retrieved and elaborated on through annotation. Her example case is the semantic category of modality, and she discusses both the helpfulness of the available annotation schemes for retrieving examples from corpora and the requirements of an elaborated semantic scheme for the annotation of modal expressions.
In their discussion of the annotation of another complex linguistic feature, Anna Feldman and Katya Arshavskaya explore the ways that temporal categorisation can be applied to Russian, and, in the pursuit of cross-linguistic relevance, begin with a model description designed for English, as well as comparing the results to English. Their long-term aim is to develop an algorithm which will allow clauses to be recognised according to aspect.
We would like to thank Ms Tuuli Tahko, Ms Tanja Säily, Ms Tuula Chezek, Mr Mikko Alapuro and Mr Turo Vartiainen for their work as HTML editors. Tanja Säily also designed the cascading stylesheets for the series. Mr Jukka Tyrkkö worked as a coordinator in the launching process for the series.
Corpus of Early English Correspondence Sampler (CEECS)
Corpus of Scottish Correspondence (CSC)
Helsinki Corpus of British English Dialects (HD)
Helsinki Corpus of English Texts (HC)
Helsinki Corpus of Older Scots (HCOS)
Linguistic Atlas of Early Middle English (LAEME)
Linguistic Atlas of Older Scots (LAOS)
Middle English Medical Texts, compiled by Irma Taavitsainen, Päivi Pahta and Martti Mäkinen. University of Helsinki 2005. CD-ROM. http://www.helsinki.fi/varieng/CoRD/corpora/CEEM/
Parsed Corpus of Early English Correspondence (PCEEC)
Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME)
Penn-Helsinki Parsed Corpus of Middle English (PPCME)
York-Toronto-Helsinki Parsed Corpus of Old English Prose
Archer, Dawn, Jonathan Culpeper & Mathew Davies. forthcoming. "Text-linguistics preprocessing". Corpus Linguistics: An International Handbook, ed. by Anke Lüdeling, Merja Kytö & Anthony McEnery. Berlin: Mouton de Gruyter.
Archer, Dawn & Jonathan Culpeper. 2003. "Sociopragmatic annotation: New directions and possibilities in historical corpus linguistics". Corpus Linguistics by the Lune: Studies in honour of Geoffrey Leech, ed. by Andrew Wilson, Paul Rayson & Anthony McEnery, 37-58. Frankfurt: Peter Lang.
Culpeper, Jonathan & Merja Kytö. 2000. "Data in historical pragmatics: Spoken interaction (re)cast as writing". Journal of Historical Pragmatics 1 (2): 175-199.
Löfberg, Laura, Jukka-Pekka Juntunen, Asko Nykänen, Krista Varantola, Paul Rayson & Dawn Archer. 2004. "Using a semantic tagger as dictionary search tool". Proceedings of the 11th EURALEX (European Association for Lexicography) International Congress, vol. 1, ed. by Geoffrey Williams & Sandra Vessier, 127-134. Lorient: Faculté des Lettres et des Sciences Humaines, Université de Bretagne Sud.
Pilz, Thomas, Andrea Ernst-Gerlach, Sebastian Kempken, Paul Rayson, & Dawn Archer. 2007. "The identification of spelling variants in English and German historical texts: Manual or automatic?" Literary and Linguistic Computing, http://dx.doi.org/10.1093/llc/fqm044