Studies in Variation, Contacts and Change in English
Volume 6

Methodological and Historical Dimensions of Corpus Linguistics

Edited by Paul Rayson,1 Sebastian Hoffmann2 & Geoffrey Leech3

1 University Centre for Computer Corpus Research on Language, Lancaster University
2 Department of English Studies, University of Trier
3 Department of Linguistics and English Language, Lancaster University

Publication date: 2011


Andersen, Gisle
Corpora as lexicographical basis – The case of anglicisms in Norwegian

This paper reports on the status of ongoing corpus building and lexicographical work within the framework of the Norwegian Newspaper Corpus project. Specifically it describes the work flow, tools and methods used in the identification and analysis of new anglicisms in Norwegian. Surveying the lexical borrowing from English serves a variety of purposes, including lexical acquisition, the extraction of terminology, language technology, and more general linguistic purposes such as surveying the amount and inventory of English loan words in various usage domains. Observations from such a survey may form an empirical basis for language policy decisions, such as considering efforts for preventing domain loss. While previous work in Norwegian lexicography has generally relied on manual methods for excerpting new words – and for identifying anglicisms among the new words, the current project is an effort to develop tools which automatise the process of identifying, segmenting and analysing new loan words from English. A system for corpus-based language monitoring has been set up at Uni Digital (formerly Unifob AKSIS), in close cooperation with lexicographers at the University of Oslo. The paper describes briefly the overall workflow and focuses especially on alternative methods for identifying anglicisms (lexicon-based, n-gram-based, combinatory methods). The paper also presents some main trends and statistics regarding the use of anglicisms, as well as future plans for exploitation of this material.

Brekke, Magnar
Exploiting salience and fuzzy matching in evaluating term candidates in comparable corpora

This paper reports on an ongoing experiment to develop a methodology for capturing term candidates in comparable corpora by exploiting their textual salience or significance in evaluating the statistical likelihood of a given vocabulary unit turning up in a specific text. For this experiment the specific texts were selected from the KB-N English Corpus of Comparable Text, one American, one British/European, both from the subdomain of taxation and representing the genre of regulation. Visual inspection of parallel frequency lists sorted by salience ratio revealed a limited but significant lexical overlap as well as interesting divergences. By moving from exact matching to “fuzzy” matching and combining captured items into fuzzy groups of lexically related but morphologically divergent forms the degree of terminological correspondence between the two comparable texts was seen to increase considerably, strengthening the case for a given term candidate or even bringing an otherwise unnoticed term candidate above the salience threshold. Implementation of the approach tested and described here would significantly increase the amount of text that can be brought within the scope of automatic term capture.

Damascelli, Adriana Teresa
An e-based environment for teaching and learning English for the social services

The present paper presents the reader with a project which is being developed at the University of Torino for enabling students of the degree course in “Social Services” to learn English with special reference to their profession. The project consists of the development of an e-based teaching and learning environment. The project is motivated by the lack of English language teaching and learning materials related to the profession of the social worker, at least, on the Italian market, where attention is more drawn to English for other purposes (e.g.: business, law, and medicine). The aim is to provide language learners with resources which can be used beside those which are currently available. These consist of two manuals which have been prepared specifically: one is addressed to beginners and learners with basic notions of English grammar and lexis, whereas the other manual is used with students whose level of English is pre-intermediate.

For the realisation of the present project, corpus linguistics methodology has been applied. A corpus of about three million words has been built. The texts included in the corpus deal with topics connected to the profession of social worker. Some have been selected for reading comprehension activities and provide the learners with links to the concordances of some highlighted terms which are considered as relevant to the language of the profession. The availability of a key term list has been useful for the building of a glossary of reference.

Gardner, Anne-Christine
Word formation in Early Middle English: Abstract nouns in the Linguistic Atlas of Early Middle English

Since the early 1990s historical word formation, in particular derivation in Early Middle English, has increasingly attracted scholarly interest in the form of more general approaches to productivity and semantics. The present study focuses on the derivational patterns available to speakers and aims to identify factors which could influence the speakers’ choices. For this purpose abstract formations in Early Middle English will be investigated, specifically (near) synonyms involving the Germanic suffixes -dom, -hood, -ness and -ship in which various suffixes can be attached to the same base without any or with only little differentiation in meaning. Abstract nouns ending in -lac, its Scandinavian cognate -leikr and -reden are also taken into consideration since – despite their subsequent, virtually complete demise – they still form an observable part of the lexicon and are represented in doublets such as fairness ~ fairleikr and fellowship ~ fellowreden. Regional and temporal variation, as well as the influence of text types, are shown to be factors which may have motivated the choice of suffixes in such synonymous derivations. The corpus-linguistic analysis is based on the Linguistic Atlas of Early Middle English, 1150–1325 (LAEME), the most recent corpus dedicated to the early period of Middle English. Owing to the patchiness of records, the texts are grouped in a way similar to the prototypical text categories proposed by the Helsinki Corpus in order to facilitate the comparison of data across space, time and text type. LAEME as a new research tool also offers an opportunity to re-examine previous statements which have to be amended in the light of new data.

Gather, Kirsten
Object dislocation in English hymns between 1500 and 1900 – A corpus-based study

Hymns form a fundamental part of Christian worship, and have been a very stable genre for at least four centuries while other verse genres altered a lot. My study is concerned with a linguistic peculiarity that applies to many English hymns: They show a considerable number of syntactic dislocations. We find objects, complements, and obligatory adverbials that are moved in front of the predicate verb instead of remaining in their ‘proper’ position.

Miles Coverdale, for instance, offers clauses such as “Thus wyll I all thy synnes forgyue” (1535). Nahum Tate and Nicholas Brady versify “On ev’ry side, thy hand I find” (1696). About 1780, we find examples by the Methodist Charles Wesley, such as “You for higher ends were born”, and Henry Baker’s Hymns Ancient and Modern (1861) provides “So we the father’s help will claim”.

Syntactic dislocation is a striking characteristic of many hymn texts. As objects are the largest and most prominent group of dislocated obligatory constituents, this study is concerned with object dislocation. The textual basis is a pilot corpus of hymns (64,000 words) stemming from 1500 to 1900. It comprises text samples of the most widespread hymnals of the time, as well as selections taken out of less known works.

The study shows that object dislocation is by no means a uniform process. It varies, for instance, in the type of dislocated phrase and the involvement of auxiliaries. The most prominent factors for dislocating the object are, however, metre and rhyme. There is a close connection between syntactic and prosodic factors.

Up to now, there have been no corpus-linguistic studies about English hymns, as verse is usually considered too artificial to form an object of research. My study may be seen as a first step towards the linguistic coverage of a poetic genre.

Hiltunen, Turo & Jukka Tyrkkö
Existential there constructions in early medical texts

Despite the extensive literature on the syntax of the existential there construction (ETC), little information is available on how it is used in specific registers in the history of English. This paper explores the use of the ETC in one specific domain, namely vernacular scientific writing in Middle and Early Modern English, and relates the findings to broader developments in the history of medical writing. Using the Middle English Medical Texts (MEMT) and Early Modern English Medical Texts (EMEMT) corpora as primary material, we examine the frequency of use of the ETC and compare four of its phraseological patterns (choice of verb, tense, use of modals, polarity) across different categories of medical writing. The results are compared to data from previous research on the history of the construction.

The main findings of this paper are twofold. Firstly, while the overall frequency of the ETC is relatively stable in medical writing from the 15th to the 17th centuries, considerable variation can be observed between different categories of medical writing and, in particular, between individual texts. Secondly, significant variation is seen in the phraseological patterns of the construction; notably, the relative frequencies of constructions formed with the verb BE and constructions with negative polarity increase from ME to EModE. Comparing these results to contemporary reference data suggests that these developments are to some extent at least specific to the register of medical writing, reflecting developments in both medical discourse and the underlying styles of thought.

Kehoe, Andrew & Matt Gee
Social tagging: A new perspective on textual ‘aboutness’

Society is increasingly dependent on digital information. Much of this is available online free of charge but metadata is at a premium. This has encouraged the emergence of a new online phenomenon known as social (or collaborative) tagging. The predominant social tagging site is Delicious, which allows users to assign keywords (or ‘tags’) to their bookmarks (favourite web pages) to describe their content. These tags are then shared with other users, who can search the collection by tag. However, many of the linguistic problems which exist in traditional keyword search remain. Most research on tagging to date has been conducted by information scientists, but this paper describes new work which is examining social tagging from a corpus linguistic perspective. Our discussion compares the new, text-external aboutness indicators offered by social tagging with text-internal aboutness indicators. We illustrate how we are using this multi-layered approach to aboutness both to make better sense of the existing social tagging and to suggest guidelines for better tagging practice. Our work aims to reconcile the worlds of formal textual analysis and intuition.

Kohnen, Thomas, Tanja Rütten & Ingvilt Marcoe
Early Modern English religious prose – A conservative register?

In this paper, we challenge the generally alleged status of Early Modern English religious prose as a conservative language variety resistant to language change and linguistic innovation. We show that a uniform description of religious language as conservative does not reflect the actual language use in the various religious genres and that Early Modern religious prose forms a continuum rather than a solid archaic block with regard to the developing standard variety. Tracing the development of thou vs. you, -th vs. -s, be vs. are, and the which vs. which in prayers, catechisms, religious biographies and sermons, we show that most religious genres follow the general development of the language, though sometimes later and to a lesser extent. Only few linguistic features are clearly diagnostic of religious language, and few genres have preserved these features exclusively in the religious domain.

Laitinen, Mikko
Contacts and variability in international Englishes: Compiling and using the Corpus of English in Finland

The past few decades have witnessed a substantial spread of English as the language of international communication. One of the consequences of this growth of English world-wide is that there is a need to develop new corpora that make it possible to describe these expanded uses of English. To illustrate the expansion of English, this article first presents an overview of the linguistic situation of present-day Finland, a country in which English has for long been used as a foreign language with no institutional status. The various uses of English in Finland are illustrated drawing from the results of the recent large-scale survey of English in Finland, carried out at the Research Unit for Variation, Contacts and Change in English (VARIENG). These survey results provide useful background information for an on-going corpus compilation process of the Corpus of English in Finland. The article then presents the results of three case studies that explore morphological and grammatical variability in this material.

Lehmann, Hans Martin & Gerold Schneider
A large-scale investigation of verb-attached prepositional phrases

This paper explores verb-attached PPs as adjuncts or parts of verbal frames with the help of a large-scale valency-database generated from the output of a dependency parser. Our investigation is based on more than 240 million words of American and British English. We combine measures of surprise with measures of lexical diversity in order to study fixedness of various types. We also use these measures to explore the cline from verbal complement to adjunct and test statistical measures of surprise as a means of distinguishing complements from adjuncts. We calculate measures of surprise and variability for verb-preposition, verb-object-PP and other combinations in order to identify and locate verbal idioms.

Meijs, Willem & Susan Blackwell
Loaded words: Evolving interpretations of ‘anti-semitic’ and ‘anti-semitism’ in dictionary definitions and in public discourse

Some words are loaded with connotative associations that make them highly sensitive elements in public discourse, especially political and legal discourse. This is certainly the case with the words anti-semitic and anti-semitism.

While Semites and semitic were originally used to refer to a broad ethnic category that included both Arabs and Jews, their derivatives anti-semitic and anti-semitism came to be applied, from first use, almost exclusively to people of Jewish ethnicity or religion, meaning roughly ‘hatred of / hostility towards Jews’. In some quarters over the past few decades there has been a further semantic shift, involving an extension of the meaning of anti-semitism to include criticism of, or hostility towards, the state of Israel. This paper traces these semantic shifts both in evolving dictionary definitions and in public discourse as evidenced in the Bank of English and the World-Wide Web.

More recently still, the European Monitoring Centre on Racism and Xenophobia (EUMC) devoted some emphasis in its 2004 report to the lack of a common definition of anti-semitism, and promptly offered one. The resulting “EUMC Working definition” has been taken up throughout the EU: for instance in the British Parliament through the report of its All-Party Parliamentary Group against Antisemitism (September 2006). This did not volunteer a definition of its own but concluded: “We recommend that the EUMC Working Definition of antisemitism is adopted and promoted by the Government and law enforcement agencies.”

Our study revealed that the terms semite, anti-semitic and anti-semitism are the focus of much linguistic contention, particularly in the UK. We found that collocational patterns in data culled from the World-Wide Web fluctuated widely from one year to another: disturbingly, this appeared to be largely due to the influence of pressure groups and documents in the public eye at the time. We conclude by drawing some salutary lessons for linguists and lexicographers.

Mudraya, Olga & Paul Rayson
The language of over-50s in online dating classified ads

This paper reports on a study examining key words and key semantic domains in the data collected from the online classified ads on the dating website KindredSpirits. We use the Wmatrix web-based corpus processing software tool for linguistic analysis, in order to compare the language of men looking for women, men looking for men, women looking for women, and women looking for men. The age group under investigation is the over-50s.

Linguistic research into the language of online dating ads is still scarce. The vocabulary and semantics of the online dating ads have not yet been investigated, although a number of studies in psychology and evolutionary anthropology have identified important personal trait categories, such as age, physical attractiveness, resources (current or future earning potential), and commitment to the relationship (Bereczkei & Csanaky 1996; Bereczkei et al. 1997; Greenlees & McGrew 1994; Wiederman 1993), as well as entertainment and social skills (Miller 1998). Robin Dunbar was involved in a series of evolutionary psychology investigations of different categories of words in Lonely Hearts advertisements (Waynforth & Dunbar 1995; Pawłowski & Dunbar 1999a; Pawłowski & Dunbar 1999b; Pawłowski & Dunbar 2001) that found that men and women attached different levels of importance to the following five categories of traits: attractiveness, resources, commitment, social skills and sexiness.

This paper describes the results arrived at using our corpus-based methodology and compares them with those in Pawłowski and Dunbar's (2001) study. In our data, all five of Pawłowski and Dunbar’s categories appear as statistically significant key semantic domains, and we find other statistically significant categories. Being happy, energetic and enjoying life appear at the top of our list. Similarly to Pawłowski and Dunbar’s (2001) study, sexiness is not statistically significant in either of the heterosexual groups, although the sexual relationship category is statistically significant for homosexual men. However, even in this subgroup, general relationships based on friendship appear to be more important than sexual relationship.


