Studies in Variation, Contacts and Change in English
Volume 10

Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources

Edited by Jukka Tyrkkö, Matti Kilpiö, Terttu Nevalainen & Matti Rissanen
Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki

Publication date: 2012


Old English and Anglo-Latin

Akimoto, Minoji
Rivalry between expect and hope with particular reference to their constructional developments

According to Dixon (2005: 409), the verbs hope and expect belong to the WANTING type. Quirk et al. (1985: 1114), on the other hand, classify these comment clauses into three types in which I hope is said to express the speaker’s emotional attitude, and I expect is said to express the speaker’s tentativeness over the truth value of the matrix clause. Historically, hope is a native English word, already used actively in the Old English period, while expect derived from Latin ex(s)pectare (‘await’), and came into English about the middle of the 16th century. Irrespective of their provenance, in present-day English, however, these two verbs behave similarly, with some differences in meaning, taking the to-infinitive and that or zero clauses. But at the same time, they differ in some functions. While hope is often used in the form of I hope (that), expect is more often used in the form of ‘expect to do something’. I hope is also more frequent in parenthetical use than I expect.

This paper discusses the functional development of these two verbs from late Middle English to present-day English based on the Helsinki Corpus, the Archer Corpus and the FLOB Corpus, with particular attention paid to the process of competition in which their apparent similarity of meaning and functions have split into different directions.

Heikkinen, Seppo
Elision and hiatus in early Anglo-Latin grammar and verse

A central prosodic feature of nearly all Latin verse, classical and medieval alike, is the avoidance of hiatus, where a word with a final vowel, or in classical verse, a final m, is followed by a word with an initial vowel, (or, in classical verse, an initial h). In classical Latin, hiatus is eliminated by a process known as elision, or synaloephe, where the final vowel of the preceding word is fused with the following one or left unpronounced. The avoidance of hiatus is a feature of most medieval verse, as well, but early insular verse forms a notable exception: hiatus is abundant in many rhythmical Irish and Anglo-Latin hymns, and even in much of Anglo-Latin hexameter poetry elision is not systematically observed. This is a noticeable feature, above all, in the verse of Aldhelm and his followers, although Aldhelm himself gives a detailed description of elision in his treatise on metrics (De metris). Bede took an opposite line in his strenuous avoidance of hiatus in his own verse. In his metrical treatise De arte metrica he went so far as to condemn hiatus as a ‘pagan’ feature, a view which he attempted to corroborate with examples of Vergil’s artful deviations from the rules of elision. Bede’s reluctance to recognise hiatus for a contemporary rather than a pre-Christian feature reflects his attempts to show Christian verse in as favourable light as possible, although, in a roundabout way, he probably also tried to regularise the prosodic practices of Anglo-Latin verse. The present paper studies this discrepancy between Anglo-Saxon prosodists’ views on hiatus and the prevailing verse technique of the same period and suggests that some seeming idiosyncrasies in Aldhelm’s use of elision are, in fact, probably based on certain less-known practices of Late Latin hexameter verse.

Lowrey, Brian
Early English causative constructions and the “second agent” factor

This paper examines the types of complement constructions associated with the principal periphrastic causative verbs in Old English. It seeks to show how complement selection is determined, to some extent at least, by semantic factors, and by some of the causative universals identified by Song (1996). Complementation patterns occurring with the implicative causatives, for instance, are different to those associated with common non-implicative manipulative verbs. And among the causatives themselves, individual verbs tend to select different kinds of complement, associated with different types of causative meaning. One parameter that proves to be particularly relevant to the choice of both verb and complement is agentivity, not only that of the causer, an element that has often been highlighted in previous work, but also that of the causee, the “second agent,” and the extent to which causer and causee interact. It is argued here that this factor plays an important role in the distribution of causatives in Old English, as well as driving some of the changes which have taken place within the causative group. Indeed, this paper shows that three different causative verbs, (ge)don, make and cause, have undergone much the same type of evolution, at different stages in the history of English, each in its turn becoming increasingly associated with fully agentive contexts.

Sówka-Pietraszewska, Katarzyna
On the development of a prepositional object construction with give verbs, motion verbs and Latinate verbs in English

This study focuses on the development of the prepositional object construction selected by ‘give’ verbs and ‘motion’ verbs in OE and ME. Specifically, it shows that with the erosion of the morphological case system this construction started to develop two underlying variants of meaning. Each variant was determined by the lexical meaning of a verb. As a result, the variant with ‘give’ verbs realized a caused possession meaning, while the variant with ‘motion’ verbs realized a caused motion meaning. Next, the analysis reassesses the approaches to Latinate double object verbs and proposes a new perspective on approaching their lexical meaning realization in the prepositional object construction.

Yanagi, Tomohiro
Ditransitive alternation and theme passivization in Old English

The purpose of this paper is twofold: (i) empirically, on the basis of data retrieved from the York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE; Taylor et al. 2003), to show (a)symmetries between the Theme and Goal arguments in ‘Theme passive’ constructions of Old English (OE); and (ii) theoretically to examine what syntactic positions the Theme and Goal arguments can occupy in clause structure. In the Theme passive construction of OE, the Theme argument (Direct Object) is marked with nominative case, while the Goal argument (Indirect Object) is marked with dative case. This type of passive is not allowed in present-day English, in which the Goal argument is usually marked with nominative case in ditransitive passive constructions.

It is often said that OE allows both ‘Goal-Theme’ and ‘Theme-Goal’ orders in double object constructions. The frequencies of the two word orders are quite close to each other, as observed by Koopman (1990a) and Allen (1995). Given this, it would be expected that in passive constructions with ditransitive verbs, both the ‘Goal-Theme’ and ‘Theme-Goal’ orders are observable with close frequencies. However, the frequencies of these two orders are not so close, unlike those in active double object constructions. Through a study of the YCOE, the present paper shows that when both the Theme and Goal are nominal, the ‘Theme-Goal’ order is preferred over the ‘Goal-Theme’ order; and that when both arguments are pronominal, on the other hand, the nominative Theme precedes the dative Goal much more frequently and the Theme rarely follows the Goal. In addition, a pronominal argument, whether it is Theme or Goal, tends to precede the other nominal argument in the Theme passive construction.

Areal and regional variation: Mapping out new territory

Anderwald, Lieselotte
Throve, pled, shrunk: The evolution of American English in the 19th century between language change and prescriptive norms

This paper investigates the evolution of some morphological differences between American and British English in the 19th century, and examines potential prescriptive influences on these developments. In particular, variable past tense forms in u/a-verbs (sing, shrink, drink) are compared to verbs in <u> (spin, sling) which show very little variation. This type of variation between different strong forms is compared to variation between strong and weak forms (in the verbs thrive and plead), and to the evolution of dynamic have gotten. In all cases, American grammars can be shown to differ strikingly from British grammars; however, evidence for direct prescriptive influence cannot be found. Instead, 19th century grammars (at least for the phenomenon of past tense verbs) seem to mirror, in most cases with considerable time lag, what was happening in the language of the time.

Brinton, Laurel J., Stefan Dollinger & Margery Fee
Balanced corpora and quotation databases: Taking shortcuts or expanding methodological scope?

This paper focuses on the utility of a dictionary-based database for the purposes of linguistic research. The database is the product of digitizing a historical dictionary, A Dictionary of Canadianisms on Historical Principles (Avis et al. 1967), as part of an updating process. The citations of headwords in context (some 30,000) from the first edition of the dictionary have been combined with newly collected historical and present-day quotations (some 36,000) to form the Bank of Canadian English (BCE). Despite its medium size of 2.3 million words, it is the largest structured historical database of Canadian English.

After describing aspects of the process of collecting data for the BCE that distinguish it from other quotation databases, we present two test cases. First, in respect to deontic modality, we see a rise in have to and decline in, although not obsolescence of, must in real-time in the BCE that is consistent with apparent-time studies of contemporary Canadian English. Second, we examine the expansion of the progressive passive in Canadian English. The BCE provides quite early examples of the progressive passive and produces patterns of frequency quite similar to that shown by COHA for US English. The early appearance of the progressive passive in Canadian English casts doubt on one influential view concerning its origin. Based on these two case studies we argue that, beyond mere comparability, (semi‑)structured databases such as the BCE perform as well as historical balanced corpora in the area of historical regional dialectology.

Evans, Mel
A sociolinguistics of Early Modern spelling? An account of Queen Elizabeth I's correspondence

This article considers the potential of studying Early Modern spelling variation using a corpus-based historical sociolinguistic framework. In the first half of the paper the theoretical and methodological facets are outlined and an approach established, before the practicalities of spelling analysis are explored using a case study of the spelling system of Queen Elizabeth I. The results provide strong theoretical support for the integration of spelling into historical sociolinguistics, with Elizabeth’s background providing correlates between her education and age and her spelling practice. Whilst constrained to an idiolectal study, the analysis also suggests that the methodological difficulties attached to historical spelling are not insurmountable if certain precautions are taken in compiling a corpus. The article finds that spelling is a significant dimension of Early Modern English that warrants further, substantial investigation.

Jensen, Vibeke
The consonantal element (th) in some Late Middle English Yorkshire texts

This paper attempts to examine the distribution of variant spellings for the consonantal element (th) within a corpus of 43 Late Middle English Yorkshire texts; 21 religious prose texts and 22 legal documents. The consonantal element corresponding to PDE /θ, ð/, here defined as the variable (th), shows much variation in Middle English. It also shows specific developments in the northern area, which raise interesting questions about orthographic practices and sound-to-spelling relationships. The variant spellings within each text have here been recorded, and an analysis of their chronological and geographical distribution has been carried out. The study is based on two types of data; a manual questionnaire survey of entire texts or large samples of them, and a subset of the Middle English Grammar Corpus (Stenroos et. al. 2011), which consists of transcriptions of 3,000-word samples of the same texts; all of the texts were localised in the Linguistic Atlas of Late Mediaeval English (McIntosh, Samuels and Benskin 1986). There is within the religious material no clearly discernible correlation between type of spelling and date, or between spelling and geographical localisation. For the documents, on the other hand, there is a slight increase of the <th> spelling over time. On the whole, the material shows strong evidence for a ‘Northern system’ that distinguishes between voiced and voiceless fricatives. At the same time, the material shows a clear distinction between grammatical words and lexical words, and variation between <y/þ> and <th> could therefore tentatively be interpreted as lexical rather than phonemic.

Stenroos, Merja & Kjetil V. Thengs
Two Staffordshires: real and linguistic space in the study of Late Middle English dialects

The purpose of this paper is to problematize the kinds of geographical distribution shown in Middle English dialect maps. It presents two sets of dialect maps of the late medieval county of Staffordshire, based on different approaches to medieval dialect geography. The first set is based on a corpus of local documents, organized according to the geographical provenance of the texts. The other one, based on the Middle English Grammar Corpus (MEG-C), represents the kind of maps published in the Linguistic Atlas of Late Mediaeval English or LALME, where the texts are localized on linguistic grounds. In the terminology of Williamson (2000: 119–120), the two sets of maps represent “geographical space” and “linguistic space” respectively; accordingly, the localizations on each kind of map mean quite different things and should be kept distinct. The paper compares the distributions shown in the two sets of maps and discusses their implications for the study of linguistic variation in Middle English. It is not suggested that one kind of map is “better” than the other; rather, they are complementary in the sense that they answer very different research questions.

Annotation and methods

Archer, Dawn
Corpus annotation: a welcome addition or an interpretation too far?

‘[...] annotation schemes and tools have become
an important dimension of corpus linguistics...’

The above quotation is taken from the Helsinki Corpus Festival website. The website goes on to highlight how annotation schemes and tools can help us to address a variety of issues … syntactic, semantic … pragmatic and sociolinguistic. The implicit message, then, seems to be that annotation – ‘the practice of adding interpretative linguistic information to a corpus’ (Leech 1997: 2), whether automatically or by manual means – will result in a “value added” corpus (Leech 2005). The value of annotation is further echoed by McEnery and Wilson’s (2001: 32) statement that ‘a corpus, when annotated’ becomes ‘a repository of linguistic information’ which, rather than being left implicit, is ‘made explicit’, and their later proposal that such ‘concrete annotation’ makes retrieving and analysing information contained in the corpus “quicker” and “easier”.

There are dissenting voices, of course. The late John Sinclair (2004: 191) believed corpus annotation to be ‘a perilous activity’, which somehow adversely affected the “integrity” of the text. He was particularly aggrieved by the possibility that researchers would only observe their corpus data ‘through the tags’ and hence miss ‘anything the tags [were] not sensitive to’ (p191). Susan Hunston is less critical than Sinclair. However, she does suggest that one of the strengths of corpus annotation – the ability to retrieve specific data systematically – might also prove to be a weakness if the researcher remains oblivious to the possibility that their research questions are being shaped, to some extent, by the categories used during the retrieval process (2002: 93).

In this article, I will provide some background in regard to the ongoing “annotation debate” before moving on to assess the strengths and weaknesses of my own use of annotation – manual and automatic – at the part-of-speech, semantic and (socio-)pragmatic level, using datasets which represent both present-day English and English of times past. My aim is to explore/suggest ways in which our texts are being enriched and might be further enriched – as opposed to being “contaminated” – by the annotation process (cf. Sinclair 2004); such that we become “sensitive annotation users” (and/or developers). I will also offer some blue-skies thinking in regard to an automated Historical Semantic Tagger which allows you to “shape the tags”, to some extent; and will ask whether such exploits – if achievable – might serve to transcend a distinction currently fuelling the “annotation debate”… that of corpus-based vs. corpus-driven.

Diemer, Stefan
Orthographic annotation of Middle English Corpora

This article presents examples from an image-linked Middle English research corpus with an alternative set of tags and description elements for spelling and spelling-related variation. It is argued that inclusion of these additional description elements is useful in order to examine spelling variation. The term spelling is used to indicate orthographic variation on a graphemic, lexical and morphosyntactic level. Spelling variation includes: abbreviations, scribal variation, additions, narrow and broad script, multi-level writing, cancellations, on-the-fly corrections and multiple corrections. Spelling-related features are any related manuscript properties (such as layout and preservation) that may influence spelling. Spelling-related variables included are: material and background variation, script type and size, line spacing, decorations, layout, deterioration, glossing and punctuation. The examples are taken from the Wycliffe Spelling Corpus (WSC), which is currently being compiled at Technical University Berlin. It combines text editions and original manuscript images from texts by writers associated with the English religious reformer John Wycliffe during the second half of the 14th century. While several of the variables are already integrated and tagged as parts of existing corpora, the proposed sets of tags and description elements allow optimization of corpora for both qualitative and quantitative orthographic research. The integration of the original manuscript images also facilitates further differentiation by referring back to the original and will be useful for research in related fields such as verb morphology or syntax.

Komen, Erwin
Coreferenced corpora for information structure research

Research in the area of information structure requires texts that are not only annotated syntactically, but have also been enriched with coreference information, from which the information status of constituents can be derived. This paper describes two computer programs that facilitate building a corpus of such enriched texts: "Cesax", which can be used to semi-automatically add coreferential information, and "CorpusStudio", which facilitates corpus research projects on enriched texts. The applicability of these two tools for information structure research is illustrated by a set of queries that look for prepositional phrases that introduce a new referent in a text. The proportion of PPs that do this grows from 54% in Old English to over 75% in late Modern English, the length distribution of the coreferential chains they introduce remains roughly the same throughout, and the proportion of those that introduce protagonist-like participants, being no more than 8% at any time, rises from Old English to Middle English, after which it gradually decreases again.

Lijffijt, Jefrey, Tanja Säily & Terttu Nevalainen
CEECing the baseline: Lexical stability and significant change in a historical corpus

Being able to trace language change in corpus data is premised on the assumption that the corpora providing the evidence remain comparable over time. General-purpose corpora such as the Helsinki Corpus of English Texts have been compiled using as closely similar text selection criteria as possible within each major period and even across periods. At the same time, we know that genres evolve over time, and diachronic continuity cannot be achieved over periods as long as the thousand years covered by the Helsinki Corpus, or even over shorter stretches of time.

We explore the diachronic continuity of a single-genre corpus, the 17th-century part of the Corpus of Early English Correspondence, by analysing the frequencies of all lexical items in the corpus over time. Our aim is to test the core assumption that a single-genre corpus can provide relatively homogeneous data over time. We review the effects of using particular statistical tests and their parameter choices for identifying significant changes and study the potential effects of the English Civil War (1642–1651) on the ongoing language change as well as on the use of war-related vocabulary, defined in terms of the Historical Thesaurus of the Oxford English Dictionary.

Our findings align well with previous research in that the choice of statistical test matters and that the Civil War did have an impact. We also find that in general the corpus appears to be fairly stable over time, thus providing the premised continuity.

Schneider, Gerold
Adapting a parser to historical English

The automatic syntactic analysis of natural language texts has made impressive progess. But when applied to historical linguistic texts, parser performance drops considerably. We first report parsing performance on several time periods in the Archer corpus. Second, we implement and evalutate a variety of adaptations to historical texts. We address spelling issues, adapt local grammar rules, and conduct an analysis of errors. Parser performance increases due to our adaptations. This pilot study also tests a number of more global extensions to address freer word order and different punctuation in earlier texts. We give an outlook on possible applications.

Voutilainen, Atro
Tagging old with new: an experiment with corpus customisation

There is a growing number of solutions for linguistic annotation of corpora that represent a present-day standard written variety of a natural language. Corpus linguists interested in diachronic or dialect corpora have a more limited choice of annotation solutions: annotation software built for a standard variety of a language generally provide an unacceptably low analysis accuracy when applied to a diachronic or dialect corpus. This paper outlines a simple semiautomatic solution to provide a more accurate annotation for a diachronic corpus while using a tagger originally built for a standard variety of the language: problematic words in the corpus are identified and translated (“standardised”), the tagger is run on the standardised corpus, and the resulting more accurate analysis is combined with the original corpus. An informal evaluation on a 18th Century English letter corpus is reported with promising results.

Variation in Context

Cesiri, Daniela
Investigating the development of ESP through historical corpora: the case of archaeology articles written in English during the Late Modern period (and beyond?)

A number of recent linguistic contributions focus on the study of English for Specific Purposes (ESP), a discipline that studies the use of English for communicating specific and specialized knowledge in the global context. However, as specialised terms often enter the general lexicon as well, diachronic linguistic inquiry is essential to study the development of ESP as also the evolution of general English.

Specific terms become, then, part of the general public’s linguistic repertoire, contributing to the spread of ‘scientific’ lexis and to the popularisation of specialized knowledge. One example of a discipline that awaits further linguistic investigation is archaeology, a field that is becoming increasingly popular among the general public both due to the desire to rediscover our ancient past and also thanks to the spread of popularised publications, journals, television programmes and movies (Clack & Brittain 2007). The investigation of a historical corpus of archaeology texts and essays is therefore important for study the evolution of the discipline’s specific discourse in English and how the language of archaeology in English has evolved to become a distinct branch of ESP.

A previous contribution (Cesiri forthcoming) considers the linguistic features of present-day cultural heritage research articles (of which archaeology constitutes an important part). In continuation of this study, my article will seek to investigate the linguistic features characterising publications in English on archaeology. I will consider in particular the beginning of the discipline, that is to say the nineteenth and early twentieth centuries, which are the core centuries in the development of scientific techniques in archaeology and it gaining a proper academic status. The study will use a corpus of archaeology texts and the corpus analysis software Wordsmith Tools 5.0 (Scott 2008). Finally, the results from this study will be compared with those from Cesiri (forthcoming): this will be essential in the investigation of disciplinary and linguistic evolution in the field of archaeology as a distinct type of ESP.

Hillberg, Sanna
Relativiser that in Scottish English news writing

Relativisation is a well-researched area of English syntax. However, the main interest has been on spoken language and investigations on relativisation strategies in written language are scarce. My forthcoming PhD thesis seeks to fill in this gap. In this paper I discuss some preliminary findings of my work with respect to the relativiser that use in written news texts in Highland and Island Scottish English (HIScE) and Lowland Scottish English (LScE). These findings are compared with those in Standard English and Irish English.

As my database I use the Corpus of Scottish English On-line Press News (CSEOPN), which is a corpus I have compiled for the purposes of my PhD study, and the written news reportage sections of two components of the widely-used International Corpus of English (ICE), viz. ICE-Great Britain Release 2 and ICE-Ireland.

The findings show that differences arise in the relativiser that use between regional varieties of educated English of the British Isles, suggesting that variation in relativisation is not “an exclusive right” of spoken dialects, but extends to standard written varieties as well.

Rodríguez-Puente, Paula
Talking “private” with phrasal verbs: A corpus-based study of the use of phrasal verbs in diaries, journals and private letters

This paper focuses on the analysis of English phrasal verbs in diaries, journals and letters, the three genres of ARCHER 3.1 (A Representative Corpus of English Historical Registers, 1650–1999) which can be regarded as closer to oral or colloquial language. The aim is to investigate whether, just like in Present-Day English, these constructions can be associated with informal, speech-related registers in earlier stages of the language or, on the contrary, their presence in a text type is conditioned by other factor(s) and, if so, which one(s). Results show that, as already noted by Thim (2006), the contents of the text may prompt the use of phrasal verbs to convey predominantly literal meanings. However, this statement applies not only to earlier stages of the language but also to contemporary English. I further argue that the particular idiolects and individual preferences of the authors of the texts are especially important for the frequency of phrasal verbs in a text, as well as the changing characteristics of text types over time.

Vásquez-Lopez, Vera
The role of the audience in the use of action nominalizations in early modern scientific English

Attention has often been drawn to the frequent use of nominalizations in Early Modern scientific English (Gotti 2006: 679, Banks 2008). However, those texts typically classified as ‘scientific writing’ cannot be considered homogeneous or ‘stationary’ (Halliday and Martin 1993: 54), due to the diversity of their intended audiences and the different areas of knowledge represented.

In this paper, my working hypothesis is that such factors may have played a role in the kinds of nominalizations preferred in different scientific texts. Romance nominalizations would be more prevalent in more academic texts, directed to a learned, professional audience, whereas native nominalizations ending in –ing would occur more frequently in more popular texts. In order to test this, I analyze three categories of medical writing, each with different intended audiences (see Bennett 1970: 141–145, Taavitsainen 2001: 88), namely, academic treatises, surgical treatises and remedybooks.