Annotation and methods

Matti Rissanen and Jukka Tyrkkö
Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki

The papers included in the section focusing on corpus annotation introduce and discuss the applications of annotation and innovative annotation schemes, with special emphasis on answering the question of how annotation can improve the results of corpus-based research. General principles of annotation are discussed in detail and new programs are introduced. Hot-topic issues raised in several papers include the usefulness of spelling standardization and the challenges of applying tools developed for Present-day English to historical varieties.

Dawn Archer’s contribution, “Corpus annotation – A welcome Addition or an Interpretation Too Far? emphasizes the new openings and enrichment offered by sophisticated annotation systems to corpus studies while pointing out that dissenting voices have also been heard. Arguing that corpus annotation is, despite its challenges, a “welcome addition”, Archer reminds us that a corpus linguist ought to be aware of both the production histories of the texts they study and the historical, social and political histories of the time periods from which they come. The article also discusses the characteristics and features of most recent annotation systems developed in Lancaster, such as the Variant Detector VARD and the Historical Semantic Tagger. (Two short videos from Archer’s Conference plenary are included.)

Stefan Diemer’s extensive article “Orthographic annotation of Middle English Corpora” introduces a new encoding system for annotating Middle English corpora with manuscript information. Diemer's annotation system pays attention to orthographic variation, scribal variation and manuscript properties, such as material, line spacing, decoration, etc. The author discusses the shortcomings of the current TEI P5 standard for the purposes of spelling research, and illustrates his system in detail by describing the Wycliffe Spelling Corpus (WSC) project at Technical University Berlin. 

Erwin Komen's article “Coreferenced corpora for information structure research” introduces new methods of annotating coreferential information in XML. He introduces two computer programs, one for the semi-automatic annotation of coreferential relations and the other which helps in finding results from texts annotated in this way (“Cesax” and “CorpusStudio”). Both programs are freely available from the author. Komen illustrates the programs with examples of investigations focused on the referents of prepositional phrases in texts covering the period from Old to Early Modern English.

The article “CEECing the baseline: Lexical stability and significant change in a historical corpus” by Jefrey Lijffijt, Tanja Säily and Terttu Nevalainen focuses on the problems of applying standard statistical tests based on the bag-of-words model, like the chi-square test and the log-likelihood ratio test, to corpora that span long timelines. The authors argue that the bootstrap test is preferable to bag-of-words tests because the resampling involved acts as an estimate of the variations in frequency counts. Another important contribution is the discussion of concurrent significance testing of multiple hypotheses, where the authors point out that uncorrected metrics have a tendency to lead to false findings of significance. Using this methodology, they investigate the diachronic continuity of the 17th-century part of the Corpus of Early English Correspondence by analysing the frequencies of all lexical items in the corpus over time. While the overall picture is that of relative stability, the authors show how historical events, such as the English Civil War, had an effect on the speed and quality of language change as well as on cultural vocabulary.

Gerold Schneider's article “Adapting a parser to historical English” addresses the challenges of using PDE parsers with historical varieties of English. Working with the Pro3Gres parser, Schneider identifies some of the major reasons for parsing mistakes, such as non-standard sentence length and  punctuation, and discusses methods of improving accuracy such as spelling standardization and the addition of new parsing rules for specific lexical items such as but and lest.

In his article “Tagging old with new: an experiment with corpus customization” Atro Voutilainen discusses the problems encountered when taggers and parsers developed for Present-day Standard English are used in the grammatical annotation of diachronic or dialect corpora. He outlines the process of preparing the corpus for tagging by stripping existing annotation from corpora and translating non-standard forms into standard varieties, and discusses how the tagged version is combined with the original corpus. Voutilainen illustrates his method using an extract taken from an eighteenth-century English corpus. The  usefulness of the approach is demonstrated by the tagger’s error rate decreasing by more than a half. The method can also be developed for more accurate analysis of syntactic structures.

The PowerPoint presentation of the Conference paper “Innovators of Early Modern English spelling change: Using DICER to investigate spelling variation trends”  by Alistair Baron, Paul Rayson and Dawn Archer introduces a new web-based tool for exploring spelling variant patterns. DICER analyses spelling variant standardisations to discover and quantify character edit rules, which represent spelling decisions made by authors, scribes, editors and publishing houses.


