Studies in Variation, Contacts and Change in English
Volume 2

Towards Multimedia in Corpus Studies

Edited by Päivi Pahta, Irma Taavitsainen, Terttu Nevalainen & Jukka Tyrkkö
Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki

Publication date: 2007


Blackwell, Susan
Variations in "motherese" pronoun usage

The register used by adults to young children, variously termed "baby talk", "motherese", "caretaker talk" and "child-directed speech", is known to involve alterations in pronoun usage, but little research has been conducted on this subject since Wills (1977).

This paper presents research which examines mother-child dyads using the CHILDES database. Two groups studied involve normally-developing children, one set acquiring Dutch (Groningen corpus) and the other English (Manchester corpus). The other two groups are from English-speaking children in the USA with Down syndrome and autism (Flusberg corpus).

While the Manchester mothers were found to be remarkably homogeneous in their 'deviant' pronoun usage, the other mothers exhibited more variation. Interestingly, the mothers of the autistic children modified their language in different ways from the mothers of those with Down Syndrome. This paper discusses the extent of the variation and its possible causes, and concludes with some observations about the efficacy of pronoun modifications in assisting the speech of children with communication disorders.

Gries, Stefan Th. & Caroline V. David
This is kind of / sort of interesting: variation in hedging in English

One of the domains where corpus linguistics has been particularly successful is the analysis of variation in the choice of lexical items that is governed by the context around the slot into which one out of several functionally similar lexical items is to be inserted. In this study, we investigate the variation found in two near synonymous hedging expressions - kind of and sort of - on the basis of data from contemporary British English. We first retrieved all instances of kind of and sort of from the British National Corpus World edition. As a second step, we annotated each instance for:

  1. the lexical item(s) that the hedging expression modified;
  2. the word class(es) instantiated by these expressions;
  3. the medium and the register of the instance.

Finally, we investigated the resulting multidimensional table using distinctive collocate/collexeme analysis (cf. Church et al. 1994, Gries 2003, Gries and Stefanowitsch 2004) and techniques for the analysis of multidimensional contingency tables to determine how and to what extent the two expressions differ. Our discussion of the results focuses on factors that govern the choice of hedge; the factors include (i) factors external to language (viz., the situationally/contextually defined register or text type of the utterance(s) in question) and (ii) factors internal to language (viz., the so far unnoticed preferences of kind of and sort of to be used together with particular lexical items and semantic fields).

Kehoe, Andrew & Matt Gee
New corpora from the web: making web text more 'text-like'

In this paper we discuss the first stages in the development of the WebCorp Linguist's Search Engine. This tool makes the web more useful as a resource for linguistic analysis by enabling users to search it as a corpus on a vast scale. We report on how the Search Engine has been designed to overcome the limitations of our existing WebCorp system by bypassing commercial search engines and building web corpora of known size and composition. We examine in detail the nature of text on the web, beginning with a discussion of HTML format and the development of tools to extract the main textual content from HTML files whilst maintaining sentence and paragraph boundaries. We move on to look at other file formats, such as PDF and Microsoft Word, in an attempt to ascertain whether these offer the linguist different kinds of textual content for the building of corpora.

Kohnen, Thomas
From Helsinki through the centuries: the design and development of English diachronic corpora

This paper gives an overview of the major issues connected with the design and development of English diachronic corpora. It addresses challenges of diachronic corpus design (for example, corpus size, the erratic distribution of surviving texts, genre continuity, and the lack of pragmatic and sociolinguistic information about texts) and it gives a survey of the most influential English diachronic corpora, tracing a major development from "long and thin" to "short and fat" corpora. On the other hand, it suggests some directions for corpus design which include research connected with the compilation of the Corpus of English Religious Prose at the University of Cologne. These suggestions deal with the sections and functions of texts as possible links between corpora, the concept of a "stratified corpus", and the distinction between "producer genres" and "receiver genres" in diachronic corpora.

Kytö, Merja, Peter Grund & Terry Walker
Regional variation and the language of English witness depositions 1560-1760: constructing a 'linguistic' edition in electronic form

This article has a twofold aim: to introduce ongoing work on an electronic edition of English witness depositions from the period 1560-1760; and to demonstrate in two case studies that this edition is particularly appropriate for studies of regional variation in Early Modern English. The first part of the article outlines the background and methodology of the project. It stresses the need for an accurate, large-scale electronic database of witness depositions based on transcriptions from the original manuscripts. These manuscripts originate from a variety of regions across England. In the second part of the article, the two case studies illustrate the importance of region on language use. Regarding the third person neuter pronoun forms (hit, it, 't, and him), the older form hit is only found in the North-west. In the choice between was and were with third person plural subjects, was is only frequent in the North, in particular the North-east.

Markus, Manfred
Wright's English Dialect Dictionary computerised: towards a new source of information

The computerised version of the OED and of many recent learners' dictionaries have demonstrated the great advantages of electronic versions of big dictionaries over their alternatives in bookform. The present paper investigates the possibilities and limits of computerised versions of regionalectal data as collected in the second half of the 19th century up to Joseph Wright's famous English Dialect Dictionary (1898-1905). Inspired by projects such as Ian Lancashire's compilation of dictionaries from the Early Modern English period, the present project, called SPEED ('Spoken English in Early Dialects') and supported by a grant of the Austrian Research Fund, will, in its first phase, provide a digitised databank version of Wright's unfairly neglected dictionary and encourage exploitation of the data which will then be available for dialectology, historical linguistics of spoken English and historical lexicology/phraseology. The present paper is mainly a broad description of the Dictionary's structure, i.e. it is concerned with the eight main parameters of its entries, the problems involved in these parameters or fields when they are transferred into a database, and the large amount of information stored in them.

Ooi, Vincent B.Y., Peter K.W. Tan & Andy K.L. Chiang
Analyzing personal weblogs in Singapore English: the Wmatrix approach

The blog, as a new text type, is a very recent and phenomenally successful online genre that is receiving worldwide attention. Of the many types of blogs (adventure, travel, political etc.) available, the personal weblog/journal has emerged as one that is arguably the most interesting. We therefore propose to test the hypothesis that the personal blogs of younger teens and maturing adults (such as undergraduates) in Singapore, different though by a few chronological years, will reveal their respective online identities/cultures. A further hypothesis is that, since on-going research shows that gender is a significant sociolinguistic variable, the linguistic styles of males and females in these two groups may also be sufficiently differentiating. In order to test these two hypotheses, two corpora representing Singaporean teenage and undergraduate personal blogs respectively are compiled for the study. For the corpus analysis, we propose to use an integrated corpus linguistic tool, Wmatrix, which affords word frequency profiles, lexico-grammatical patterning, part-of-speech annotation and semantic content analysis. As users of the software, we pose some challenges for future versions of Wmatrix to handle computer-mediated patterning and varieties of English (other than British and American) for the personal blog and other newer text types of online communication.

Renouf, Antoinette & Jayeeta Banerjee
The search for repulsion: a new corpus analytical approach

In the last two decades, our research has centred on word collocation and its role in the construction of meaning in text. In this paper, we propose that there is a 'force', that we call 'repulsion', which operates in an opposing way to that of lexical collocation. By 'repulsion', we mean the intuitively-observed tendency in conventional language use for certain pairs of words not to occur together. We write within the context of a large-scale study, which has the goal of establishing how repulsion operates in text and whether it has the status of an objective and measurable 'force'. We are interested in identifying the process of actual distancing between words, rather than just enumerating the instances where word co-occurrence is prohibited by other factors such as grammatical norms, and we further wish to make a clear distinction between cases of 'indifference' and of active repulsion. This is a hitherto unexplored aspect of language in use, and we hope to develop an objective 'lexical repulsion' measure, capable of providing insights into text creation which will be of use in lexicology, language pedagogy and NLP.

Schmied, Josef
The Chemnitz Corpus of Specialised and Popular Academic English

The Chemnitz Corpus of Specialised and Popular ACademic English (SPACE) is a parallel corpus, although it contains neither texts in several languages (all texts are in English) nor from different times (all texts were published after 2000) nor from different genres (if academic writing can be considered one genre). Still, it can be used to compare academic English from different academic disciplines as well as for different readerships. The defining criterion for our new Chemnitz SPACE Corpus is that the texts included are represented in pairs, where more or less the same content is explicitly referred to in scholarly articles (as part of the specialised expert-to-expert communication of a specific discipline) and in the derived popular versions (as part of the broader journalist-to-layperson communication in general academia). The new corpus will be a useful basis for many theoretical and applied research questions, as it allows us, for instance, to compare linguistic complexity on different levels, lexical, semantic and syntactic. In each case, we might assume that (usually professional) journalists might have simplified the original academic article to adapt the same form for the less specialised reader without losing too much content.

This contribution introduces the rationale behind the SPACE Corpus, its context and set-up, and it illustrates its usefulness in analysing complex phenomena across texts and domains in several small case studies.


