Corpus Linguistics and Variation in English: Focus on Non-Native Englishes

Magnus Huber and Joybrato Mukherjee
University of Giessen


The 31st annual conference of the International Computer Archive of Modern and Medieval English (ICAME) was held at the University of Giessen, Germany, from 26th to 30th May 2010. As there is a strong corpus linguistic research tradition at Giessen’s Department of English, particularly in the areas of varieties of English, learner English and the history of English, we were very pleased and honoured to host ICAME 31. Over the past years, the research teams in Giessen have compiled corpora in diverse areas, for example the Sri Lanka and Ghana components of the International Corpus of English (ICE-SL and ICE-GH, written parts), the German component of the Louvain International Database of Spoken English Interlanguage and the Old Bailey Corpus, a corpus of 18th and 19th century spoken English. The conference topic “Corpus linguistics and variation in English” both reflected our interest in these areas and provided room for a large variety of papers, held together by a common interest in language variation and varieties of English.

We were overwhelmed by the response to our call for papers. In total, we received almost 200 abstracts. After a double-blind review process we were able to accept 60 full papers, 28 work-in-progress reports, 22 posters and 4 software demonstrations. The conference provided a multifaceted view on how the description and analysis of various dimensions of variation in English can profit from corpus data and corpus-based methodology, partly also in combination with other methods.

The three workshops at the beginning of the conference approached this theme from text linguistic, corpus methodological and historical perspectives: “News, (new) media, and corpora: from methodology to theory”, “Corpus Linguistics on the Web: Introducing the WebCorp Linguist’s Search Engine” and “Investigating Earlier Spoken English. Papers based on the Old Bailey Corpus”. Five plenary papers (published in the companion volume) showed how corpus linguistics can shed light on, and profit from, the study of the evolution of English vernaculars, language contact, stylistics, grammaticalization and the use of quantitative-statistical methods. The conference sessions were as thematically diverse, but all demonstrated how corpus linguistic methods can help us to understand language variation and varieties of English. The topics ranged from dialects of English and New Englishes over English as a Foreign Language and English for Specific Purposes to the history of English. Within these broad areas, the papers addressed specific research questions on the descriptive levels of phonetics and phonology, morphology and syntax, text linguistics and discourse and pragmatics.

A selection of the papers at ICAME 31 is published in two complementary volumes, focussing on two different aspects of the conference. One volume is entitled Corpus Linguistics and Variation in English: Theory and Description (Mukherjee & Huber 2012) and includes 18 papers, all of them addressing relevant theoretical issues and representing a wide range of descriptive studies of English. The present volume is more focussed in topic and includes 13 papers that represent the strong interest at ICAME 31 in Outer Circle varieties of English and learner Englishes. The papers are theoretical as well as descriptive in orientation and provide a significant contribution to corpus linguistics and research into non-native Englishes from a variationist perspective. The papers address three main themes:

  1. The first six papers in this volume investigate different varieties of English, from English as a Native Language, over English as a Second Language to English as an International Language.
  2. The following four contributions examine English as a Foreign Language in the workspace and the learning environment.
  3. The last three articles focus on translation corpora.

In “As the case may be: A corpus-based approach to pronoun case variation in subject predicative complements in British and American English” Georg Maier takes a closer look at the choice of pronoun forms in simple subject predicatives and subject predicatives also functioning as focus of it-clefts. Based on an analysis of the British National Corpus and the Corpus of Contemporary American English, Maier’s study shows that the choice of the subject or object form does not only depend the pronoun’s position but also its function and partly the mode of discourse. The study also illustrates cross-varietal differences.

Andrea Sand investigates “Singapore weblogs: Between speech and writing” and demonstrates that written computer-mediated communication promotes the use of features associated with spoken, non-standard varieties of New Englishes. Sand focuses on three features of Singaporean English, the use of discourse particles, zero subjects/objects/copulas and quotative like. The findings show that these features do indeed occur in the weblogs, albeit with lower frequencies than in conversations.

“The verb-complementational profile of offer in Sri Lankan English” is the topic of Tobias Bernaisch’s contribution. He compares the verb complementation of ditransitive offer in Sri Lankan English with that in the neighbouring Indian English and the historical input variety British English. Bernaisch’s analysis of the three ICE corpora as well as larger newspaper corpora indicates that there are clearly identifiable differences between the verb-complementational profiles of offer in the three varieties and concludes that Sri Lankan English appears to begin to develop its own, variety-specific norms.

Stephanie Hackert, Dagmar Deuber, Carolin Biewer and Michaela Hilbert take a wider perspective in “Modals of possibility, ability and permission in selected New Englishes”. They investigate the use of modals in Fiji, India, Singapore, Trinidad, Jamaica and the Bahamas and compare it to British English, basing their analysis on the category “private conversations” in the respective ICE corpora. The results indicate that New Englishes show greater variability in can/could usage than British English. The observed patterns are explained through a variety of influences, including local creoles, socio-cultural phenomena, language-internal constraints and learners’ errors.

The Englishes of two Inner Circle countries (Great Britain and New Zealand) and two Outer Circle countries (Fiji and India) are analyzed in Gerold Schneider’s and Lena Zipp’s “Discovering new verb-preposition combinations in New Englishes”. The aim is to discover new prepositional verbs and to evaluate manual and semi-automated retrieval methods. The results of the two approaches are compared and illustrated by new verb-preposition combinations from ICE-India and ICE-Fiji.

Ruth Osimk-Teasdale considers the possibilities and challenges of part-of-speech (POS) tagging the Vienna Oxford International Corpus of English (VOICE) in “Applying existing tagging practices to VOICE”. She reviews existing L2 tagging systems and evaluates their suitability for VOICE, in which – in contrast to other L2-approaches – diverging ELF forms are considered ‘different’ rather than ‘deficient’. Specifically, Osimk-Teasdale’s paper focuses on unconventional forms and their functions in ELF interactions and the question of how these can be accounted for by POS tagging.

In “A phraseological comparison of international news agency reports published online: Lexical bundles in the English-language output of ANSA, Adnkronos, Reuters and UPIFederico Gaspari looks at reports in English published online by four international news agencies. The paper compares lexical bundles in the texts produced by two Italian agencies with those in texts from the UK/USA agencies, the latter two serving as a native English control corpus. Gaspari discusses lexical bundles that are only found in the Italian subcorpora and shows that the overall usage of 4-word lexical bundles is much higher in the Italian texts, suggesting that these mediated news texts are much more formulaic than their native/original counterparts.

Data-driven learning is the topic of Nikoletta Rapti’s “Data-driven grammar teaching and adolescent EFL learners in Greece”. The paper reports on a corpus-based grammar study designed for a group of Greek adolescent learners of English and studies the effect of such an approach on motivation and learning results. The study compares an experimental group working on concordance-based tasks with a control group taught with a conventional grammar textbook. Rapti stresses the importance of teacher mediation and suggests that data-driven learning should first be introduced as a complement to conventional methods, until learners have become more comfortable with corpus work.

Stefanie Dose evaluates the authenticity and suitability for the classroom of scripted dialogue in “Flipping the script: A Corpus of American Television Series (CATS) for corpus-based language learning and teaching”. Her paper evaluates the naturalness of the scripted dialogue in the CATS corpus by comparing the frequencies of indicators of spoken language (fillers and the discourse marker well) with those of naturally occurring dialogue. Dose finds that while there are some differences, these are smaller than expected. She concludes that television dialogue might be an ideal compromise between naturally occurring speech and artificial textbook dialogues.

“How fluent are advanced German learners of English (perceived to be)?” asks Sandra Götz and compares “Corpus findings vs. native-speaker perception”. The aim of her paper is to assess the overall oral proficiency of advanced German learners of English, as measured by temporal fluency and accuracy. Götz identifies five prototypical learners in the German component of LINDSEI and has them rated by 50 native speakers. A comparison of these ratings with quantitative corpus findings demonstrates that native speakers of English judge learners mainly on the basis of their accent and pragmatic features, which do not correlate significantly with the corpus-derived indicators of fluency and accuracy.

Anne-Line Graedler introduces the Norwegian-English Student Translation corpus in “NEST – a corpus in the brooding box”. Her paper discusses the design and compilation of the corpus, which will contain translations from Norwegian into English produced by students of English at colleges and universities in Norway. Her article starts with an overview of learner translation corpora and outlines the principles and procedures for the selection of students, the collection of texts and the original texts to be translated. Graedler also provides samples of student translations from NEST and highlights possible future research applications.

“Semantic prosody in a cross-linguistic perspective” is the title of Signe Oksefjell Ebeling’s contribution to this volume. It explores the negative semantic prosody of cause in the English-Norwegian Parallel Corpus to show the usefulness of bidirectional corpora for the study of the evaluative meaning of extended lexical units. Ebeling concludes that there are no Norwegian lexical items that exactly match cause in terms of negative semantic prosody. The most commonly used translation is typically used in neutral contexts in Norwegian, while the semantic prosody of another common correspondence is not used to the same degree in negative contexts as cause.

Thomas Egan examines the concepts of ‘throughness’ and ‘betweenness’ in English, Norwegian, German and French. His study “Between and through revisited” is based on the English-Norwegian Parallel Corpus and the Oslo Multilingual Corpus. The tokens of English through and between are classified by semantic domains, like space and time, and their translations are coded for syntactic congruence or divergence. Egan proposes semantic networks for both prepositions and compares the translations of Norwegian mellom and gjennom into English, French and German with one another.

We would like to thank the many volunteers who made ICAME 31 a great success. Special thanks go to the German Science Foundation (DFG) and the University of Giessen for their generous financial support of the conference. We are grateful to the peer reviewers for their thorough reviews of the articles submitted for publication in the conference proceedings, and to the authors for taking the reviewers’ constructive comments seriously. We are also indebted to the editor-in-chief, Terttu Nevalainen, and the editorial board for accepting this volume for publication in the VARIENG Studies in Variation, Contacts and Change in English series. The present volume would not have seen the light of day without the help of the editorial assistants Sebastian Schmidt in Giessen and Joseph McVeigh in Helsinki, who prepared the authors’ manuscripts for publication on the Web.


German Science Foundation (DFG):


ICE Corpora:

Louvain International Database of Spoken English Interlanguage (LINDSEI):

Old Bailey Corpus:

Vienna Oxford International Corpus of English (VOICE):


Mukherjee, Joybrato & Magnus Huber, eds. 2012. Corpus Linguistics and Variation in English: Theory and Description (Language and Computers: Studies in Practical Linguistics 75). Amsterdam: Rodopi. ttp://