Studies in Variation, Contacts and Change in English
Volume 13

Corpus Linguistics and Variation in English: Focus on Non-Native Englishes

Edited by Magnus Huber and Joybrato Mukherjee
University of Giessen

Publication date: 2013


Maier, Georg
As the case may be: A corpus-based approach to pronoun case variation in subject predicative complements in British and American English

The distribution of pronoun case forms in Present-Day English has been the subject of a lively linguistic debate. There are variable contexts which seem to permit a choice between the use of subjective and objective pronoun forms. This paper provides a survey of the distribution of British and American English pronoun case forms in two of these variable contexts, i.e. simple subject predicatives and subject predicatives also functioning as focus of it-clefts. Based on data from the British National Corpus and the Corpus of Contemporary American English, this study tests longstanding hypotheses on the distribution of pronoun case forms and also sheds more light on the mechanisms determining the choice of pronoun case forms in variable contexts. First, the analysis of the data shows that it is not only the position of the pronoun relative to the finite verb but also its function which may influence its form. Additionally, it is demonstrated that the mode of discourse may significantly influence the choice of a pronoun form in some contexts but not in others. Furthermore, this study also reveals significant cross-varietal differences in the distribution of pronoun case forms and emphasises the need for larger cross-varietal databases in order to analyse infrequent morphosyntactic phenomena quantitatively.

Sand, Andrea
Using translation corpora to explore synonymy and polysemy

The aim of the present paper is to shed light on the question whether computer-mediated communication promotes the use of non-standard varieties of New Englishes in writing, e.g. Colloquial Singapore English or Singlish. The database consists of a corpus of weblogs from Singapore as well as informal speech and writing from the Singaporean sub-corpus of the International Corpus of English. The exploratory analysis of the data focuses on three features, which are commonly associated with non-standard, spoken Singaporean English. First, the use of discourse particles, such as lah or lor, is considered, as these are regarded as very typical of Colloquial Singapore English usage. On the level of syntax, zero constituents, i.e. zero subjects, zero objects and zero copulas are examined to shed light on the question whether these substrate-influenced features of Colloquial Singapore English are also used in weblogs where the immediate context of face-to-face interaction is missing. Finally, quotative like, which is associated with spoken English of younger speakers world-wideis studied as an example of a recent innovation on global scale. The analysis reveals that these features are indeed used by the bloggers, but with lower frequencies than in conversations. This is in line with studies of weblogs elsewhere in the Anglophone world.

Bernaisch, Tobias
The verb-complementational profile of offer in Sri Lankan English

The present pilot study investigates verb complementation in Sri Lankan English, a hitherto largely neglected variety in corpus-based studies of New Englishes. With a focus on the ditransitive verb offer, Sri Lankan English is studied in comparison to both British English, the historical input variety of Sri Lankan English, and Indian English, which may exert epicentral influences on other South Asian varieties of English (cf. Leitner 1992). Based on the Sri Lankan, British and Indian components of the International Corpus of English (ICE) and larger (partly web-derived) newspaper corpora, the frequencies and distributions of the verb-complementational patterns of offer are analysed with regard to the various meanings of the verb, a covariate with the potential to influence the choice of a particular syntactic pattern (cf. Bresnan & Hay 2008). The results of this pilot study indicate that there are clearly identifiable differences between the verb-complementational profiles of offer in the three varieties under scrutiny. In the light of these findings and with reference to Schneider’s (2003, 2007) model of the evolution of postcolonial Englishes, the present paper finds first indications on theoretical as well as on empirical grounds that Sri Lankan English might begin to develop variety-specific norms on the lexicogrammatical level of language organisation.

Hackert, Stephanie, Dagmar Deuber, Carolin Biewer & Michaela Hilbert
Modals of possibility, ability and permission in selected New Englishes

This paper investigates the use of modals of possibility, ability and permission in six New Englishes (Fiji, Indian, Singapore, Trinidadian, Jamaican and Bahamian English), with British English considered for comparison. The data are drawn from the text category “private conversations” in the respective ICE corpora. Based on the framework developed by Deuber (2010a), we analyse the quantitative distribution of can/could as well as these modals’ uses and meanings and also contrast them with other forms expressing possibility, ability and permission, i.e. be able to and may/might. In general, varieties of English spoken as a second language or dialect appear to show greater variability in the usage of can/could than native varieties of English. Whereas in Trinidadian and Bahamian English, could occurs considerably more often than can, Jamaican English most closely resembles British English, and the Asian Englishes under study strongly prefer can. In order to explain the patterns found, we take into consideration not only influence from the local Creole languages but also socio-cultural phenomena, topic constraints, the idiomatic usage of can in Singapore English and finally, but possibly marginally, learner errors in the case of Fiji English.

Schneider, Gerold & Lena Zipp
Discovering new verb-preposition combinations in New Englishes

The grammatical description of New Englishes is a relatively young field but at the same time one that benefitted much from recent developments in corpus linguistics. Standard reference corpora such as the International Corpus of English (ICE) have made it possible to research grammatical phenomena even in smaller outer circle varieties of English. In the field of grammar, innovations typically start out at the intersection of grammar and lexis. We investigate verb-preposition combinations in four corpora of first and second language varieties of English, among them the preliminary version of the written component of ICE Fiji. Our focus is on what has been termed ‘new prepositional verbs’ (cf. Mukherjee 2009, Nesselhauf 2009), i.e. novel combinations of verbs and prepositions.

We compare a manual and a semi-automated approach to the study of new verb-preposition combinations. The manual approach consists of a surface search for prepositions followed by a careful manual filtering process. The semi-automated approach is a corpus-driven investigation using parsed corpora and detecting variation-specific prepositional collocations. Typically, the advantage of manual searches is that precision is very high; the disadvantage is that the investigation is time-consuming and recall can be incomplete, because the scope of investigations may have to be restricted. The advantage of automatic, parse-based methods is that they are fast and corpus-driven, which may increase recall; the disadvantage is that error-rates are high, which seriously affects precision. We discuss similarities and differences in the results of the two approaches and show examples of new verb-preposition combinations from ICE India and ICE Fiji that the two approaches deliver. We conclude that both methods validate, but also complement each other.

Osimk-Teasdale, Ruth
Applying existing tagging practices to VOICE

Research on English as a lingua franca (ELF) has attracted increasing interest in recent years. An important development in this was the release of the Vienna Oxford International Corpus of English (VOICE) in May 2009, the first free-to-use resource containing over one million words of naturally occurring transcribed spoken ELF conversations. To increase its usability, we are currently investigating the possibilities of assigning part-of-speech (POS) tags to the VOICE Corpus. The nature of the data makes this a highly challenging task and raises a number of issues.

The main issue addressed in this paper is the extent to which existing tagging systems of available L2 corpora can be applied to a corpus like VOICE. The first part of this paper is a review and evaluation of POS tagging practices which have been used previously for spoken learner corpora, with reference to their applicability to VOICE. The POS tagging practices commonly used in SLA research have traditionally taken an error-based approach, viewing divergent L2 forms as steps in acquiring the target language system, and therefore deficient in relation to L1 output. The approach to ELF adopted in the VOICE project, on the other hand, takes a ‘difference’ rather than ‘deficiency’ perspective and unconventional forms are considered to be communicatively motivated rather than mere shortcomings of the target language form or ‘errors’. In this respect, ELF speakers are viewed as goal-oriented users of the language in their own right, and clearly differentiated from learners of English. Clearly, this position needs to be reflected in POS practices adopted for an ELF corpus like VOICE.

Part two of the paper then addresses in particular the issue of unconventional forms in connection with the form-function relationship that is realized in ELF interactions (cf. Seidlhofer 2009a) and discusses how this central issue in ELF might be accounted for with available tagging systems, e.g. SLAtagging (Rastelli 2009). The paper raises broader questions about how L2 language (cf. Cook 2002), whether conceived of as a learner language (EFL, SLA) or user language (ELF), can be adequately pre-processed in terms of POS tagging. This is tested in a small pilot study undertaken on a small sample of VOICE data.

Gaspari, Federico
A phraseological comparison of international news agency reports published online: Lexical bundles in the English-language output of ANSA, Adnkronos, Reuters and UPI

This paper presents a study of the lexical bundles (LBs) used in the English-language reports published online by four international news agencies: ANSA and Adnkronos (both based in Italy), Reuters and United Press International (UPI) – headquartered in the UK and the USA, respectively. Given that they provide insights into the phraseological make-up of news texts, LBs are considered revealing indicators of the complex strategies at work in the news-making process: the LBs found in the ANSA and Adnkronos online news reports (which result from an elaborate process of linguistic mediation combining translation, non-native writing, cross-linguistic summarisation, editing and adaptation) are compared with those used by Reuters and UPI (representing the native/original benchmark), in search for commonalities as well as divergences, and peculiar usage patterns displayed by individual news agencies.

Only one 4-word LB occurs in all four sources (“the end of the”), and the analysis looks more closely at those that are over- and under-represented across the corpora, considering their discursive functions to account for the observed discrepancies. The results bring to the fore the distinctive phraseological features of the mediated reports published in English by the two Italian news agencies: three LBs (“the head of the”, “as well as the” and “is one of the”) are present both in the ANSA and Adnkronos data, but they do not feature in the Reuters and UPI corpora; conversely, no LB used in both native/original corpora is simultaneously absent from the ANSA and Adnkronos data.

In addition, the overall usage of 4-word LBs is much higher in the ANSA and Adnkronos corpora compared to the Reuters and UPI data, suggesting that the mediated news texts are much more formulaic than their native/original counterparts. Another interesting finding of this study is that ANSA and Adnkronos taken individually have the highest number of LBs that are not attested in the data of any of the other three news agencies, which points to a more idiosyncratic use of phraseology in the mediated news reports.

Rapti, Nikoletta
Data-driven grammar teaching and adolescent EFL learners in Greece

Data-driven learning (DDL) is a ground-breaking approach and is attracting increasing interest in language teaching and learning, but there is a need for more empirical research with younger, lower level students and a wider focus on aspects of language learning before concordances find their way into EFL classrooms. This paper reports on a corpus-based grammar study which was conducted in Greece with a group of adolescent learners and investigates the impact of DDL on motivation and the learning of grammar. To this end, concordance-based tasks were designed in printout form for use with the experimental group, whereas a conventional grammar textbook was used with the control group. The empirical evidence drawn from the qualitative data and test performances underlines the significance of teacher mediation and suggests applying DDL in the classroom as a complement to conventional approaches initially, until learners become more comfortable with corpus exploration and the concordance format.

Dose, Stafanie
Flipping the script: A Corpus of American Television Series (CATS) for corpus-based language learning and teaching

In the past decade, linguists have increasingly advocated the use of spoken corpora for language learning and teaching to remedy the lack of spoken models in the EFL classroom (e.g. Mauranen 2004; Zorzi 2001). However, the nature of linguistic corpora (built with other than educational aims) is often not suitable for classroom use. The present paper introduces the Corpus of American Television Series (CATS), which was recently compiled at Giessen University as part of a larger, ongoing research project (Dose in prep.; see also Dose 2012). CATS consists of some 160,000 words of transcribed spoken language from four contemporary American television series and is designed with educational aims in mind, i.e. for the teaching and learning of spoken grammar. The hallmark of the corpus is that it consists of scripted speech, that is, spoken language which is based on a script. Scripted language differs in many respects from naturally occurring language. For instance, it has been noted that it is characterized by a lower frequency of performance phenomena, and so this poses the question of how ‘spoken’ it generally is and whether this type of (supposedly) ‘inauthentic’ language can still be a useful model for the EFL classroom. A first analysis of some ‘indicators of spoken style’, viz. fillers (uh, uhm) and the discourse marker well, will give some indication of where the scripted speech as represented in CATS can be situated on the continuum between spoken and written language. While there are indeed discrepancies compared to ‘real’ corpora of spoken language, it will be shown that the differences are not as marked as expected. Television dialogue might eventually turn out to be the perfect middle ground between the “messiness” (Meunier 2002) of real-language material and the scripted, often stiff and stilted textbook dialogues when it comes to finding new appropriate and accessible spoken models for teaching spoken grammar in the EFL classroom.

Götz, Sandra
How fluent are advanced German learners of English (perceived to be)? Corpus findings vs. native-speaker perception

The present study sets out to investigate the degree of fluency in the broad sense (i.e. the overall oral proficiency) of advanced German learners of English. In order to do so, a quantitative analysis of the German component of the LINDSEI corpus with regard to the learners’ temporal fluency and accuracy was conducted. It led to a selection of five learner reference types: the most accurate one, the least accurate one, one with a very good temporal fluency, one with a very poor temporal fluency and one with an average performance with regard to both variables. These five learners were rated by fifty native speakers regarding their overall oral proficiency, as well as six other central perceptive fluency variables (i.e. idiomaticity, register, lexical diversity, sentence structure, accent and pragmatic features). These ratings were analyzed in order to investigate possible correlations between the native speakers’ ratings of the learners’ overall oral proficiency and any of the other investigated variables. The analysis of these ratings reveals that the highest overall ratings do not correlate significantly with the most fluent or the most accurate learner (according to the findings of the corpus analysis), but with the most nativelike performances in the perceptive variables ‘accent’ and ‘pragmatic features’.

Graedler, Anne-Line
NEST – a corpus in the brooding box

This paper describes the design and compilation of data for the Norwegian-English Student Translation corpus (NEST). Still in the beginning stages, the prospective corpus will contain translations from Norwegian into English produced by students of English at Norwegian colleges and universities. A brief discussion of learner translation corpora is followed by an outline of the principles and procedures applied in the collection of texts, the contributing students, and the source texts for translation. Some samples of data from the collection of student translations are given as an illustration and indication of possible future research applications.

Ebeling, Signe Oksefjell
Semantic prosody in a cross-linguistic perspective

On the basis of data from a bidirectional translation corpus, viz. the English-Norwegian Parallel Corpus (ENPC), this paper aims to explore the negative semantic prosody of cause in a cross-linguistic perspective. Semantic prosody can be defined as the evaluative meaning of extended lexical units. In an article from 1995, Stubbs identifies the negative semantic prosody of cause on the basis of monolingual data, while Berber Sardinha (2000) substantiates this claim on the basis of comparable cross-linguistic data. The present paper will draw on both of the aforementioned studies, and others, in an attempt to show how bidirectional corpora can be applied to shed new light on the study of semantic prosody.

Both the noun and the verb uses of cause will be analysed in order to determine semantic prosody and lexicogrammatical patterns; Norwegian translations will be recorded in each case and serve as translational mirrors in a similar analysis going from Norwegian originals into English translations. This procedure will enable us to establish to what extent the most commonly used Norwegian correspondences (translations and sources) share the negative semantic prosody of cause.

The bidirectional method reveals that there is no Norwegian correspondence that matches cause in terms of negative semantic prosody. For instance, the most commonly used verb translation få (x til å) (‘get (x to)’) is typically used in neutral contexts in original texts. Although the third-most common verb correspondence føre til (‘lead to’) has a preference for negative contexts, it is not used in such environments to the same degree as cause. Furthermore, føre til is most commonly translated into lead to and not cause, suggesting that føre til and cause have different semantic prosodies.

Egan, Thomas
Between and through revisited

This paper examines the concepts of ‘throughness’ and ‘betweenness’ as they are encoded by the English prepositions through and between and their equivalents in three other languages, Norwegian, German and French. The data consist of translated texts in the English-Norwegian Parallel Corpus and the Oslo Multilingual Corpus. All tokens of through and between in English language original texts in the English-Norwegian Parallel Corpus are assigned to one of several classes according to the semantic domain of the predication. Examples of such domains are space, perception, time etc. The translations of each token are then categorised as either syntactically congruent or divergent. The congruent tokens are further divided between those employing the Norwegian prepositions gjennom and mellom, which correspond closely to through and between in their spatial senses, or alternative prepositions. Statistical tests are employed to show whether there are any significant differences between the various semantic classes in terms of translation equivalents. The results of these tests are used to propose semantic networks for both prepositions. Translations of Norwegian expressions containing the preposition mellom into English, French and German are compared to one another, and then contrasted with translations of Norwegian gjennom. Finally the results of the present investigation are compared to those of Kennedy.