Studies in Variation, Contacts and Change in English

Volume 20 – Corpus Approaches into World Englishes and Language Contrasts

Article Contents

“I don’t have communicate ability”: Deviations in an L2 multimodal corpus of academic English from an EMI university in China – errors or ELF?

Yu-Hua Chen, Coventry University
Simon Harrison, City University of Hong Kong
Robert Weekly, University of Nottingham Ningbo China


Deviations in language forms which are different from the norm (or commonly recognized as native-speaker standards) are often labelled as ‘errors’ by language teachers or researchers in the areas of second language acquisition or language learning. Similar non-standard forms, however, are referred to as ‘features’ in other contexts such as English as a Lingua Franca (ELF). In this paper, we argue that the notions of ‘error’ and ‘ELF’ are not always mutually exclusive, and the attribution very much relies on the context. Non-standard use of part-of-speech forms, for example, is one of the most common deviation types we identify in an L2 corpus (e.g. “I don’t have communicate ability” or “they will lead to the bad influence on the economic”). In comparison, similar ‘non-codified’ examples are also found in the VOICE corpus (e.g. “do you arrived there”, “the rest are protect area”), one of the most well-known ELF corpora. By presenting a selection of such examples extracted from the written, spoken, and multimodal components of an L2 corpus (the Corpus of Chinese Academic Written and Spoken English) from an EMI (English Medium Instruction) university in China, this paper will discuss the options regarding how we, as researchers and practitioners, can reconcile different views towards deviation and consider the implications for teaching, learning and assessment. We argue that ‘errors’ do not play as important a role in spontaneous speech as they do in academic writing, and it is also believed that in many respects the difference between an L2 English learner and an ELF speaker is contextual: when learners leave the classroom and use English, they immediately become ELF speakers, proficient or not.


Deviations in language forms which are different from the norm (or commonly recognized as native-speaker standards) are often labelled as ‘errors’ in second language studies or learner corpus research (e.g. Dagneaux et al. 1998; Nicholls 2003; Hemchua & Schmitt 2006). A whole range of deviations at lexical, lexcio-grammatical or pronunciation levels are commonly found in the Corpus of Chinese Academic Written and Spoken English (CAWSE), which we have been compiling since 2016 (Chen et al. submitted a). CAWSE is an ongoing project which aims to build a large collection of students’ L2 English samples from The University of Nottingham Ningbo China (UNNC), an EMI (English Medium Instruction) university in China. Non-standard use of part-of-speech forms, for example, is one of the most common deviation types we have identified so far (e.g. “I don’t have communicate ability” or “they will lead to the bad influence on the economic”). It appears that when similar deviations occur in other contexts, however, they are often considered natural features of an English variety. For example, the above instances of deviations (or ‘errors’ from the perspective of learner language) are very similar to ‘non-codified’ examples found in the VOICE corpus (e.g. “do you arrived there”, “the rest are protect area”), one of the well-known corpora of English as a Lingua Franca (Osimk-Teasdale & Dorn 2016; Pitzl et al. 2008; VOICE 2013). This observation illustrates Hirschmann et al.’s point (2007: 14) that “in learner language, a deviation might be analyzed as an error; in other varieties, it might be analyzed as a feature”. This poses challenges for us: in order to transcribe, annotate, and tag such deviations in a unique corpus like CAWSE, what factors need to be taken into account to decide whether we are dealing with errors or features of a variety? Whether features of Chinese English could be considered errors or innovations is related to several status factors. Li (2010) highlights five factors that Bamgbose (1998) argues are important in determining whether a language feature is an error or an innovation. These include demographics, geography, authoritativeness, codification and acceptability. Li (2010) goes on to argue that the internet is increasingly influencing the question of what constitutes a language feature, both in terms of being an authority for checking language and a conduit to spread new features. However, Li (2010) also implies that the acceptability of a new feature is still determined by L1 users if they appropriate the language feature for their own use as well as L2 users. By presenting a selection of such examples extracted from the written, spoken, and multimodal components of CAWSE, this paper will discuss the options regarding how we, as researchers and practitioners, can reconcile different perspectives about deviations and consider the implications for teaching, learning and assessment. The three different corpora of CAWSE will also be briefly introduced.

One of the richest concentrations of World Englishes (WE) around the globe is in Asia (Smith 1998), where English has been developing in China since the 17th century (Bolton 2002). The students in our corpus have already learned English since primary school or middle school, and they scored high enough on the Chinese national exam (often known as ‘Gaokao’) to study at the Sino-British university UNNC, where English is the medium for all teaching and learning activities. The Gaokao admission scores at UNNC in general exceed the first-class benchmark (‘yi ben’ - for top universities in China) significantly, and the Chinese students admitted through Gaokao still need to attend the preliminary-year program (where most of the corpus data was drawn). Despite high Gaokao scores of those students, their English language proficiency may vary, but our method of data collection ensures that the language samples cover the full range from the lowest to the highest levels (see Section 3 Corpus and Methodology). In this context, we would expect to find some features of ‘Chinese English’ (Xu 2010; Xu et al. 2017), which will be discussed in detail later. Furthermore, the university has an international faculty with a number of its degree programs also attracting international students from countries where English is not the first language, such as Russia, Turkey and Indonesia. The university therefore provides the context for an English as a Lingua Franca (ELF) setting (Seidlhofer 2011).

Despite fulfilling the conditions for both a Chinese variety of English and ELF, the CAWSE corpus can also be seen as representative of learner English. All the data in our corpus have been collected during the preliminary-year courses of the undergraduate program or the pre-sessional course of the Master’s program, and these courses are delivered by certified English language teachers and focus on English for Academic and Specific Purposes (EAP or ESP). Many of the teaching materials, classroom activities, and assessments are explicitly aimed at enhancing the language proficiency and academic skills of the students. From this perspective, the students at UNNC, or at least those whose L2 English samples we have collected, may not be so much competent users of a ‘legitimate’ variety of English as they are also English language learners, whose deviations are often a target for correction.

In this paper, we will therefore use the dual character (language learning / lingua franca) of the CAWSE corpus to argue that the notions of ‘errors’ and ‘ELF’ are not always mutually exclusive, and the attribution relies heavily on the context. In addition, it is also believed that in many respects the difference between these speakers is contextual: when learners leave the classroom and use English, they immediately become ELF speakers, proficient or not. Although data processing (including transcription and annotation) and validation work such as double-checking in our corpus is still a work in progress, a number of patterns have already emerged during the annotation of forms at lexico-grammatical and pronunciation levels that deviate from the so-called ‘native’ varieties of English. Since UNNC is fundamentally part of The University of Nottingham based in the UK, we have used British English as a point of reference. This paper will report those findings arising from samples of this EMI corpus across registers, modalities, and settings. Our examples will be taken from the transcription and annotation work conducted so far on three sub-corpora: i) the subcorpus of written academic English, ii) the subcorpus of spoken academic English (transcription of audio recordings primarily taken from spoken assessment tasks), and iii) the multimodal subcorpus based on video recordings of pair and group-work filmed in authentic classroom settings. In presenting this array of samples, we acknowledge that writing and speaking should be treated differently because speaking is often spontaneous without the luxury of selecting words or forms carefully whereas academic writing tends to conform much more strictly to standards and traditional ‘norms’. Nevertheless, by juxtaposing examples from the three different sub-corpora, the purpose of this paper is also to examine the notion of deviation more broadly. Formats such as written texts, spoken presentations, and group work activity each provide a different context or ‘ecology’ for seeing how deviations might manifest.

The paper will continue with a discussion of the differences between ELF and World Englishes (Section 2), focusing on the notion of Chinese English, which provides the background for turning to the notion of errors in English Language Teaching (ELT) and Second Language Acquisition (SLA). We then explain the methods we have used to build CAWSE and to annotate the data on which our study of deviations has been based (Section 3). The results are structured to show a selection of corpus examples from the perspectives of errors vs. ELF, Chinese English, and the multimodality of the corpus (Section 4), before concluding the paper (Section 5).



The global spread of English has resulted in the emergence of multiple norms, yet the perception that English is a single entity with a correct version tends to persist among teachers and learners (Seidlhofer 2011). Frameworks such as WE or ELF have remained a curiosity for many of those engaged in English language teaching, and they may not be aware of the practical applications of these terms for the classroom (e.g. Kirkpatrick 2003, 2007). The arguments against these frameworks is that learners need consistent models to learn (Swan 2012), while it is also noted that teachers are constrained by institutional policies and assessment practices (García 2009; Seidlhofer 2011). Traditionally within English language teaching there is a belief that there are correct and incorrect language features which are related to the language model, and the incorrect ones are those that deviate from the standard model (Tupas 2010). While it is recognized that there is considerable variation within the so-called native varieties of English, and that actual speakers of Standard English constitute only a small minority of English speakers in the UK (Britain 2010), this social dialect of English is often presented as one of the native varieties that L2 speakers of English should adhere to. These beliefs about a language system being correct or incorrect not only relate to English, but are entrenched within a standard language ideology (Milroy & Milroy 2012; Nelson 2011; Seidlhofer 2011; Walker 2010), which makes it difficult for teachers, students and other stakeholders to see unfamiliar, new, or non-standard uses of English as anything but incorrect, even though such non-standard forms are also common in spoken English or vernacular varieties of native-speaker English.

There have been numerous studies of varieties of English around the world. These studies have identified language features that are specific to the variety and different from the Englishes which they may be related to through colonialism or the transfer of cultural products. These World Englishes are usually defined in national terms or ‘named indigenous codes’ (Xu & Deterding 2017: 118), such as Singapore English (Deterding 2007), and Indian English (Pingali 2009), which are both similar to and different from other varieties in terms of grammar, lexis, pronunciation or pragmatics. Varieties of English from Hong Kong, Singapore, and India, for example, share features of negation that are not common in either British or American English (Nelson 2004).

While Chinese English has been examined by several academics (Bolton 2002, 2006; Xu et al. 2017), its status is still regarded as a developing variety. The term Chinese English has been in use for years, albeit used interchangeably with China English, or even Chinglish, and this would imply that the variety has yet to reach a point of wide acceptance within the academic community. As Chinese English continues to develop and becomes more elaborately codified, there is increased pressure for standardization to raise teachers’ awareness of the local variety (Hamid & Baldauf Jr 2013). Although Hamid and Baldauf Jr (2013) note that it was opposition to the processes of standardization that initially brought WE into being, they also highlight the belief that it is necessary to standardize new varieties of English and to distinguish them from ‘native’ Englishes.

Early ELF research followed a similar research path to WE by identifying emerging language features across different English varieties. This was unsurprising because ELF had emerged out of the WE paradigm, and at the time it was seemingly the only paradigm available (Jenkins 2015). Based on the Lingua Franca Core for pronunciation and lexico-grammar features from VOICE, a well-known ELF corpus, it was initially thought that identifying language features of ELF could have provided a basis for an ELF variety similar to national varieties in WE research (Jenkins 2000; Seidlhofer 2011). The VOICE corpus contains 120 hours or one million words of spoken ELF interactions, which cover a wide range of speech events in terms of domains and functions (VOICE 2013). Another corpus of similar size, ELFA, also comprises a structured collection of transcribed speech but with the sole focus of academic ELF (Mauranen et al. 2010). Despite these studies based on corpus data and other ELF research, however, it soon became apparent that one of ELF’s defining features is variability, and therefore a continued research goal of codification was not considered worthwhile by some researchers (Baird et al. 2014; Jenkins 2015). Consequently, while arguing that correct language or norms were not necessarily derived from British or American English, initially both in WE and ELF research, there was still an underlying belief that there were correct forms within language but that these were contextually dependent. Therefore, an Indian English speaker may deviate from an American speaker in terms of pronunciation and lexico-grammar, but both the speakers’ language is equally accurate and/or deviant in different contexts. While codification of contextually correct nonstandard forms remains an aspect of WE research, ELF research has progressed to a dismissal of determining what constitutes an error as this is “possibly not an ELF compatible way of thinking about language” (Cogo & Dewey 2012: 78). In this respect, the current conceptualization of language by ELF researchers shares more in common with multilingual research and theories such as translanguaging (Li 2018) and translingual practices (Canagarajah 2013) than it does with WE research. How today’s multilingual speakers flow seamlessly across different languages or creatively mix and blend them together further challenges the rigid distinctions and hierarchies often implicit in notions from WE, such as varieties, codes, and inner/outer circles. We are therefore also interested in what Sifakis and Tsantila (2019: 1) describe as “the function of English as a contact language in communications involving primarily non-native users of English from various international, multilingual and heterogeneous settings, to which each user brings a variety of English that he or she is most familiar and comfortable with and employs various strategies in order to communicate effectively”.

Despite the challenge to the conception of errors by WE and ELF research, researchers working within a traditional Second Language Acquisition (SLA) paradigm continue to promote a dualistic approach to the study of language. For example, Swan (2012: 386) dismisses ELF grammar features identified in the VOICE corpus as “stock entries in accounts of ‘typical learner errors’ published half a century ago”. However, this observation highlights the fundamental essence of ELF. These grammar features are more susceptible to language change in both native and non-native varieties of English, which over time could stabilize, within a community of speakers, to become the language norm. These features are also more resistant to corrective feedback in the language classroom than other language features (Ellis 2007; Long et al. 1998; Sheen 2007), which also underlines the obvious similarities between ELF and EFL (English as a Foreign Language) speakers (Seidlhofer 2011). It may therefore seem unnecessary for us to compare instances of deviations between ELF and EFL data in this paper. However, to the best of our knowledge, there does not seem to be systematic investigation with empirical data in the literature to support the above argument, particularly in the academic context.

It should also be noted that in ELF research, there appears to be a greater focus on speech than on writing, and we argue that there may be different standards for written and spoken registers of ELF. Depending on the linguistic and cultural backgrounds of the speakers, ELF speech may be characterized by variability and unpredictability; yet to what extent such features in an EFL/SLA setting are similar or different in the written and spoken registers in students’ language production will need to be examined carefully before any reconciliation of approaches can be discussed. A selection of such instances from the written, spoken and multimodal components in the CAWSE corpus will be exemplified in Section 4.


As the CAWSE corpus is situated in the Chinese context, it is necessary for us to provide some background concerning the development of English in China or Chinese English in terms of language proficiency, variation, and ideology.

According to Bolton (2002), the first reports of Chinese speakers using English during the 17th century included observations of pronunciation difficulties that drew feelings ranging from pity to ire. By the mid-18th century, a legitimate China Pidgin English (CPE) had emerged, but this was derogatorily referred to at the time as ‘broken English’, ‘jargon’, or ‘mixed dialect’ (Bolton 2002: 184). The treaty-port era of the 19th century facilitated the spread of this CPE, yet increased contact with other varieties of English at this time, such as the American English spoken by missionaries, led to its pronunciation and grammar being looked down upon and ultimately abandoned. English Language Teaching in China surged with the opening up policies of the latter half of the 20th century, paving the way for the increased importance of English in China today. In China, English is now typically taught as a foreign language (EFL) with models of native speaker Englishes being preferred, and teaching methods being primarily exam oriented. Therefore, teachers are traditionally often trained to look at language primarily with regard to the notion of correctness with a significant focus on error correction (Cogo & Dewey 2012). 

For the last fifty years, English in China has received increasingly systematic treatment by linguists as a legitimate variety of Asian or World Englishes (Xu 2010; Bolton 2006; Xu et al. 2017). The argument for Chinese English is that varieties of English in Asia are for both international and intra-regional communication, such as the medium of communication for the ASEAN (Association of Southeast Asian Nations), therefore questioning the traditional emphasis on native speakers (Kirkpatrick 2003). Chinese English has subsequently been described on different linguistic levels (see Xu 2010), including phonology (Siqi & Sewell 2012; Deterding 2017), syntax (Xu 2008; Ai & You 2017; Liang & Li 2017), pragmatics and discourse patterns (Kirkpatrick & Xu 2002; Ren 2017), and non-standard forms which deviate from native norms may be considered features of Chinese English. The linguistic description of Chinese English has therefore informed the development of our transcription conventions for CAWSE when we attempted to identify such features in the data, which will be discussed later.

From the Chinese Pidgin English documented among 17th century traders in Canton to embracing Chinese English as an emerging variety, the notion of non-standard forms (or ‘deviations’) seems inseparable from the spread and development of English in China. China has played an important role in developing the WE perspective, and it needs to be taken into account the rich sociolinguistic history of English in China (Bolton 2002, 2006) the way ELT has been introduced and developed in China, and the way English has been repurposed in an array of forms. As mentioned, defining a deviant form is dependent on whether it is seen from an ELF/WE or EFL perspective (Jenkins 2006, 2014). An EFL perspective is influenced by second language research which determines that language that does not conform to Standard English is deficient or an interlanguage and ELT teachers are typically trained to correct non-standard forms which deviate from the norms. It is our belief that scrutinizing L2 English production in the CAWSE corpus will shed some light on our understanding of such deviation forms in this error-or-ELF debate.



As mentioned earlier, the Corpus of Academic Written and Spoken English (UNNC-CAWSE) is an ongoing corpus project building a large collection of students’ L2 English samples from one of the EMI (English Medium Instruction) universities in China. Collaborating with the university’s English language centre, which also functions as a gatekeeper, data collection has focused primarily on the undergraduate preliminary-year program and the Masters pre-sessional program. These two programs have provided the context for building three distinct sub-corpora: i) the subcorpus of written assessment (1.5 million words), ii) the subcorpus of spoken assessment (recordings of 63.8 hours), and iii) the multimodal subcorpus of classroom recordings (24 hours). The subcorpus of written assessment contains 1,282 exam scripts and 770 essays. The exam scripts were hand written and needed to be manually digitalized, but the essays were originally submitted online and could be processed directly (after anonymization and editing). The subcorpus of spoken assessment contains the transcriptions of 123 oral interviews and 262 oral presentations – both originally recorded under assessment conditions for quality assurance. The multimodal subcorpus contains approximately 17 hours of seminar discussions from the preliminary year and pre-sessional program classrooms (88 audio/video clips), as well as 7 hours of student-led sessions of spoken English practice arranged by the center for any students seeking additional English language support (31 audio/video clips) (for more detail of the CAWSE multimodal subcorpus, see Stevens et al. in press).

The data collection phase of CAWSE has now finished, but processing (including transcription and annotation) and validation work such as double-checking is still ongoing. Samples of the data are currently available through our project website (which can be found at the end of this chapter) in plain text, while other formats of the data will be made available as the project progresses, including a selection of audio/video recordings. Note that despite the EMI setting, almost the entire corpus is collected from L1 speakers of Mandarin Chinese, although some of them might have a different Chinese dialect as their mother tongue. To ensure the representativeness of the Chinese student population of the campus, the written and spoken assessment materials were sampled from each of the available score bands from the lowest to the highest grades wherever possible (see Chen et al. submitted a).


When our team started transcribing the data in all three subcorpora, we came across instances of usage that deviated from widely adopted standard Englishes, in this case British/American English used as a reference for CAWSE. These instances were sometimes difficult to transcribe, e.g. pronunciation variation, and discussing them became a recurrent agenda item at our team meetings. As such deviations were a salient feature in CAWSE, a taxonomy with a series of tagging conventions was then developed to transcribe and annotate deviations. The deviation forms in the corpus show a wide range of variability from phonological, lexical, grammatical, semantic to pragmatic levels, and it would require an enormous amount of time and manpower to identify all those instances. On the basis of the resources we have, it was decided early on that we had to be selective and focus on a few types of deviations only: pronunciation (for spoken, only if the deviation hinders comprehensibility), orthographic (for written), and lexical or lexico-grammatical (for both written and spoken, including inflectional or derivational, word choice, and part of speech deviations).

Besides the contribution to the error-or-ELF debate, there are also practical implications for deviation forms annotated in the corpus. A large corpus like CAWSE is designed to be equipped with the functionality of search queries, that is, a user can type in a word or a phrase, and then the corresponding concordance lines of the target word/phrase will be displayed. In the case of non-standard forms, such as “discuss” misspelled as “disscuss” as identified in our written subcorpus of exam scripts, the deviation form “disscuss” would not turn up in the result since the user would not be able to predict various variation forms of deviation for “discuss”. To annotate a deviation, we follow the tradition of Corpus Linguistics by inserting a pair of tags composed of an opening tag and a closing tag, and both the original deviation and the standard form will be presented during the transcription process. The coding for deviation “disscuss” would therefore look like <dv>disscuss {discuss}</dv> with the standard form “discuss” inside curly brackets {}.

In addition to annotation of the above deviations, we kept in mind the possibility that instances of deviation could also contain examples of features which may reflect Chinese English as an emerging variety. This is why we used preliminary tagging of Chinese transliterations and semantic translations of Chinese idioms (discussed below) in the corpus, which are coded in a similar way as the tagging of deviant forms described above but with English translations provided if possible. The development of our transcription conventions has been documented in detail in Chen et al. submitted b).


In this section, we will first present deviations from the written and spoken subcorpora. This allows us to evaluate the notion of error with English as a Lingua Franca (Section 4.1) and Chinese English (Section 4.2). An example from the multimodal subcorpus then shows how students and teachers deal with deviations dynamically in the classroom, broadening the scope of behaviors that may be viewed as deviations in the context of English Medium Instruction (Section 4.3).


Part of the subcorpus of academic written English comes from hand-written exam scripts, and probably because of the absence of a spell checker in a testing condition, a large number of deviations at orthographical level were found in the exam scripts. Following a taxonomy which categorizes spelling errors from Bestgen and Granger (2011), those spelling deviations can be divided into two groups: misspelling and word formation. The former refers to any words that do not exist in the online Oxford Dictionary used as a reference for the CAWSE project, and often this includes cases such as one or more letters missing or misplaced in an existing word such as ‘disscuss’ mentioned earlier. The latter refers to word formations which do not conform to the rules of orthography, e.g. adding a suffix ‘-ly’ to an adverb ‘almost’ or the plural morpheme ‘-s’ to an adjective ‘different’, neither of which are allowed in English. In addition to misspelling and word formation, we also added three new types of deviations identified in CAWSE to the taxonomy proposed by Bestgen and Granger (2011): proper nouns, compound words, and incorrect in the context. It appears that some of the Chinese students struggle with the spelling of proper nouns such as the names of well-known figures (e.g. ‘Steve Jobs’) or cities (e.g. ‘Seattle’), or they may be confused about whether they should combine words into a compound word (e.g. ‘copy right’ as two words as opposed to one compound) or splitting them (e.g. ‘eventhough’ as one word). The last category ‘incorrect in the context’ refers to the spelling of a word looks legitimate (with reference to the Oxford dictionary), but it does not make sense in the context. One such example can be seen below.

(1) The mermaid as a well-known image once was criticized due to the <dv>nicked{naked}</dv> body.

It is clear from Example (1) that the original word ‘nicked’ is a possible misspelling which happens to be another word, and another word ‘naked’ that we added in the curly brackets {} appears to make more sense in this context. The whole taxonomy of deviations at orthographical and lexical levels with more examples can be found in Chen et al. (submitted).

Although the taxonomy of deviations is developed for the written subcorpus, interestingly, some similar examples at different levels are also identified in the spoken subcorpus of CAWSE. Those include deviations at the level of word formation, and the example of deviation at derivational level can be seen in (2) below, while one example of inflectional deviation is given in (3), both of which are similar to the instances discussed in the written subcorpus.

(2) I stay in the library <dv>alonely{alone}</dv>
(3) …so many <dv>childrens{children}</dv>

Other similar examples include deviations at lexical or lexico-grammatical level. An instance of deviation involved with incorrect word class can be seen in Example (4) below, while an example related to collocation or incorrect word choice is provided in (5). Again, similar instances are also observed in the written subcorpus.

(4) some music bands will er: maybe <dv>creative{create}</dv> some music
(5) sometimes we buy a <dv>purchase{product}</dv>

A unique type of deviation found in the spoken data is when it is unclear whether the deviation is a result of an incorrect word choice as in a lexico-grammatical error or a case of mispronunciation. One typical example is the confusion of word class, particularly for references to academic subjects. For those references, the forms of an adjective and a noun only differ at the ending of a word such as ‘economic vs. economy’ or ‘linguistic vs. linguistics’ (see examples (6) and (7) below). Since this is spoken data, it is sometimes difficult for us to determine whether the speaker just drops a consonant of the word or actually uses an incorrect word class, even though stress pattern could sometimes be used to help clarify the issue.

(6) maybe it's very bad for the <dv>economic{economy}</dv> er: to the country
(7) I think intelligence have seven types hh first one is <dv>linguistic{linguistics}</dv>

After the taxonomy of deviations was developed for CAWSE, those instances of deviations were then compared with the VOICE corpus (Vienna-Oxford International Corpus of English), one of the most representative ELF corpora (VOICE 2013). From the outset, VOICE was defined as a corpus of spoken ELF, which aimed to collect only fluent speech data from proficient ELF speakers in the domains of education, leisure, and professional activity. In comparison, the CAWSE corpus is a collection of academic written and spoken samples of L2 English from an EMI university in China, and based on our experience, the language proficiency of many students in CAWSE covers a wide range of levels. We acknowledge that ELFA (a corpus of academic ELF) rather than VOICE (a corpus of general ELF) might seem more suitable for comparison with CAWSE at first glance. VOICE, however, was chosen here for two reasons. First, the spoken data from CAWSE is comprised of interviews and presentations as part of the assessment at the preliminary-year program from an EMI campus in China, which is not comparable with ELFA, where seminar discussions (33%), PhD defense discussions (20%) and lectures (14%) account for two-thirds of this corpus of academic ELF (Mauranen et al. 2010). Considering the lower level of Chinese students’ English proficiency and ‘generic’ nature of the assessment tasks in CAWSE, the other ELF corpus VOICE with a more generic coverage of domains is a better choice. The second reason is that the non-standard forms identified in VOICE are documented in detail in Pitzl et al. (2008) and Osimk-Teasdale and Dorn (2016), which we believe would not differ significantly from spoken ELF found in ELFA.

Despite these fundamental differences between the two corpora, similar deviations are found in a range of areas described earlier (Table 1). First of all, the deviation with word-class issues is also prominent in VOICE. As reported by Osimk-Teasdale and Dorn (2016), utterances such as “the rest are protect area” in VOICE include cases of ambiguous word class of ‘protect’, thus posing great challenges for the task of part-of-speech tagging in the VOICE project. As mentioned earlier, similar deviations are also found in both the written and spoken subcorproa of CAWSE. Another similar type of deviation shared by VOICE and CAWSE involves using words which do not follow the rules of word formation of English morphology. For example, for verbs with irregular past tense or past particple forms, e.g. think-thought-thought, rise-rose-risen, some of the ELF speakers in VOICE and Chinese students in CAWSE tend to resort to regular forms, i.e. thinked or rised. Similarly, instances where the plural morpheme -s is added to uncountable or mass nouns (e.g. informations), which is  not considered legitimtate in Standard English, are identified in both VOICE and CAWSE. Other similar lexical deviations include adding or missing a syllable in an existing  word, e.g. summamary for summary in VOICE, markly for markedly in CAWSE. Sometimes the creation of new words may look legitimate by following the rules of morphology in English although those words cannnot be found in a dictionary. For example, the addition of the nominal suffix -ment to ‘increase’ creates a new word ‘increasement’, and this word is found in both VOICE and CAWSE.

Word class
  • the rest are protect area
  • they will lead to the bad influence on the economic
  • I don’t have communicate ability
Word formation
  • Verb: thinked {thought}, feeled {felt}
  • Noun: advices {advice}, fundings {funding}, informations {information} *
  • One syllable (missing or redundant): summamary {summary}, compy {company}
  • Verb: rised {risen}, bited {bitten}
  • Noun: datas {data}, informations {information}
  • One syllable (missing or redundant): markly {markedly}, accomdadation {accommodation}, angrous {dangerous}
  • pronunciate, unformal, increasement
  • cleanity, increasement, unfamous

Table 1. Similar deviations shared by VOICE and CAWSE.
* Identical deviation forms found in both VOICE and CAWSE are indicated in bold font.

Despite the geographical and discouse differences between VOICE and CAWSE, the findings here suggest that L2 speakers from both VOICE and CAWSE share certain lingustic features, at least in the typology of deviations identified here. Note that the written and spoken subcorpora of CAWSE are composed of assesment materials including exam scripts, coursework or interviews. The deviations identified in the assessment scripts are likely to be flagged by teachers as ‘errors’, but they are considered ‘unconventional’ or ‘lexical innovations’ in the ELF corpus VOICE (Osimk-Teasdale & Dorn 2016; Pitzl et al. 2008). In contrast, the word-choice or collocational deviations common in CAWSE (e.g. buy a purchase discussed earlier), to the best of our knowledge, do not seem to have been discussed much in the literature with the use of VOICE, which could be a direction for future research.


The tagging of spoken data is still ongoing at the time of writing, but two types of features have been preliminarily annotated in CAWSE which seem to support the notion that some of the features found in the data could be identified as features of Chinese English (as discussed earlier): Chinese transliterations and semantic translations of Chinese idioms. Those instances are annotated in a similar way described earlier but with translations provided in the curly brackets and tagged as <lv> (lexical variation) instead of <dv> (deviation). Chinese transliteration refers to Romanized Chinese characters, also known as pinyin. This type of linguistic feature has been reported in He and Li (2009) and is also found to be common in our data in CAWSE. One such example is Gaokao, which refers to the national university entrance exam in China (see the use of Gaokao in example (8) below). Considering the academic focus in CAWSE, it is not surprising this expression occurs quite frequently in our corpus.

(8) all the students need to take <lv>Gaokao {national university entrance exam}</lv>

The second type of feature involves the use of Chinese idioms, which differs from the transliteration example just discussed, as the idioms are semantically translated by the Chinese students in the corpus. Examples (9) and (10) provide the contexts of how the L2 speakers incorporated those Chinese idioms into the utterances.

(9) so I think by emphasising the improvement of supervision mechanism to restrict the politicians we can actually created environment to <lv>strangle the corruption in cradle {kill corruption at the earlier stage/把腐败扼杀在摇篮里}</lv>
(10) I think a <lv>craft attitude {craftsmanship/匠人精神}</lv> is also required er let me explain that the film maker should not only consider their work as commodity only as entertainment but also artistic products with valuable meanings.

Translation or not, the Chinese speakers here seem to have the assumption that the reader/listener understands those expressions. To what extent this is true may require more research in this area. [1]


Recall that the multimodal subcorpus is based on video recordings of pair and group-work filmed in authentic classroom settings. The examples of deviation in this subcorpus may be similar to those described above for the subcorpus of spoken assessment, but the classroom recordings provide additional perspectives on the context of such deviation. We see how a pronunciation problem is oriented to by peers clearly as an error to be fixed, and when they switch to Chinese to fix it, their codeswitching is characterized as a failure to conform to the English-only policy by the teacher, even though codeswitching and other translanguaging practices might be considered as a form of ELF skill (Sifakis & Tsantila 2019).

Our example begins with the mispronunciation of the letter ‘j’ in Transcript 1 below (transcription conventions in Appendix 1). This occurs in a groupwork activity, as a student is reading her answers to an exercise in the textbook. The answers run from the letter ‘a’ to ‘i’. When one of the students, Li, reads out her answers (line 1), she pronounces ‘g’ incorrectly as ‘j’ (/dʒeɪ/). This is immediately flagged as an error by one peer Bai who provides the standard form ‘g’ (/dʒi/; line 2). In the next turn (line 3), Bai’s correction is then upgraded by another student in the group Sun who repeats the standard form more loudly (‘G!’) before switching to Chinese to say “j 什么鬼” (meaning ‘J what the hell?’). Evidently over-hearing this conversation, we hear the teacher in the recording interject “English ladies English” (line 5).

Transcript 1
1 Li i a e j c
2 Bai gee
3 Sun gee:::! j 什么鬼? {j what ghost ha?}
4 Li gee:::
5 T English ladies (.) English

This example initiates a language related episode (where students discuss or even question the language they use) within which students collaborate to resolve the language issue (Swain & Lapkin 1998; Ohta 2001). It also tells us that in this context not just language production but also language practices such as codeswitching are treated as ‘deviation’ - in this case, a behavior deviating from the language policy in an EMI campus. This is better explained in the segment that follows – Transcript 2. We join the students an instance later, as the struggling student Li requests clarification in Chinese by saying “那这个呢” (this one then?) whilst drawing a letter J with her finger (line 6). All students reply in chorus with the correct pronunciation “J” (line 7). The student then replies “oh ok” – a change of state marker suggesting the problem has been resolved (line 8). Her peer Sun offers a final form of assistance by suggesting she can remember the pronunciation by thinking of the Korean dance routine “gee gee”, which she says whilst jogging her upper body (line 9). Again, the teacher requests the students to adhere to the English-only policy, “ahem English (.) ladies (.) English (.) please” (line 11), then reframes this as a formal reprimand about speaking ‘English in class’ (lines 12–14).

Transcript 2
6 Li 那这个呢 ((draws letter J with finger in space in front of her)) {this one then?}
7 All J:::
8 Li oh ok
9 Sun gee你就记住那个gee 那个韩国那个东西 ((dancing motions)) {you just remember that gee, that Korean one thing}
10 Li oh oh oh ok
11 T ahem English (.) ladies (.) English (.) please
12 T right guys come on I said (.) after week one
13 I shouldn’t have to remind you about this
14 English in class please

Over the course of this episode, our attention has been brought to another form of ‘deviation’ which does not figure in our written and spoken typologies, but nevertheless makes a valuable contribution to the discussion in this paper. Current ELF research, in the main, has moved beyond a descriptive approach to language in terms of pronunciation or lexico-grammar features, and recognize ELF as being variable and fluid. What this effectively means is that any language feature has the potential to be either an error or ELF, and it is societal norms and perhaps intelligibility, here in the form of peer-correction and in-group agreement, that determines whether the feature is perceived as an error rather an element of ELF. We also see that certain languaging practices, such as codeswitching and speaking in the L1, which are used to resolve language issues among students, may constitute deviations that teachers often try to fix in the classroom context. It seems that deviation may take on different characteristics, which depends on whether the speaker and receiver perceive the feature or action as a deviation. This example also demonstrates, within the classroom, pragmatic features of repair and negotiation which are common in ELF interactions (Cogo & Dewey 2012).


The examples illustrated in this paper have shown us that the traditional conception of two languages (L1/L2) as mutually exclusive is perhaps not a compatible way of thinking about English in the globalized context of an EMI campus in China. Similarly, the notion of errors appears to have been dismissed, or at least challenged, by the WE or ELF research, and hopefully the comparison of deviations identified in both VOICE and CAWSE has shed some light on our understanding of this view. We discussed how terms such as ‘deviant’ and ‘natural’ reflect language ideologies and we illustrated how these ideologies are embodied in ‘English-only’ policies that may impact how teachers and students negotiate and evaluate variation in language use during the interactive seminars recorded for our corpus. We acknowledge that ‘deviations’ in the written and spoken registers in this academic context should probably be treated differently because misspellings or ‘innovative’ word formations (e.g. ‘informations’ or ‘increasement’) are more likely to be perceived as errors rather than non-standard forms in academic writing, which in turn appears to have much more rigid criteria for acceptable norms. On the other hand, we would also like to argue that perhaps such standards should be relaxed to some extent for L2 speaking in an EFL/EMI classroom since non-standard forms such as “I don’t have communicate ability” identified in the CAWSE corpus are also commonly found in proficient ELF speech. In our corpus, almost all of the students are L1 Mandarin speakers, which in a way makes CAWSE less open to variation than ELF corpora such as VOICE or ELFA but also makes it possible to study Chinese English in this EMI setting.

Whether the speaker is considered to be a learner of English or a user of a legitimate variety is only one among many factors that determine a perspective on deviations. These factors may also include context of language use, dominant approach to language teaching and learning (e.g. behaviorist, cognitivist, or socio-cultural), perception of the students or teachers, and influence from language ideologies, but it has not been possible to address these additional perspectives within the scope of this chapter. Furthermore, the range of behaviors deemed to deviate from explicit or implicit norms in the EMI setting should also be broadened from production to also include practices (such as code-switching, using L1, etc.). This would require in-depth studies to quantify how much L1 is used, when, and by whom. What this chapter presents here is the natural outcome of the corpus building process, and we hope that future research projects can follow up on our findings once the corpus has been finished.


The UNNC CAWSE project was funded by the Ningbo 3315 Talent Scheme in China and the Matched Funding from the University of Nottingham Ningbo China.


[1] Note that because of limited resources, we have not been able to explore a much wider range of features identified in Chinese English such as because-therefore sequencing reported in the literature (e.g. Xu 2010), but this would certainly be another potential direction for future research. [Go back up]


Oxford Dictionary online: https://www.lexico.com/en


CAWSE = The UNNC Corpus of Chinese Academic Written and Spoken English. 2019. Director: Yu-Hua Chen; Researchers: Simon Harrison, David Oakey, Shanru Yang, Godwin Ioratim-Uba, Michael Stevens & Qianqian Zhou. https://cawse.transcribear.com

ELFA = The Corpus of English as a Lingua Franca in Academic Settings. 2008. Director: Anna Mauranen. http://www.helsinki.fi/elfa/

VOICE = The Vienna-Oxford International Corpus of English (version 2.0 XML). 2013. Director: Barbara Seidlhofer; Researchers: Angelika Breiteneder, Theresa Klimpfinger, Stefan Majewski, Ruth Osimk-Teasdale, Marie-Luise Pitzl & Michael Radeka. http://www.univie.ac.at/voice/


Ai, Haiyang & Xiaoye You. 2017. “Lexis-grammar interface in Chinese English: A corpus study of the prototypical ditransitive verb GIVE”. Researching Chinese English: The State of the Art, ed. by Zhichang Xu, Deyuan He & David Deterding, 49–60. New York: Springer.

Baird, Robert, Will Baker & Mariko Kitazawa. 2014. “The complexity of ELF”. Journal of English as a Lingua Franca 3(1): 171–196.

Bamgbose, Ayo. 1998. “Torn between the norms: Innovations in world Englishes”. World Englishes 17(1): 1–14.

Bestgen, Yves & Sylviane Granger. 2011. “Categorising spelling errors to assess L2 writing”. International Journal of Continuing Engineering Education and Life-Long Learning 21(2/3): 235–252.

Bolton, Kingsley. 2002. “Chinese Englishes: From Canton jargon to global English”. World Englishes 21(2): 181–199.

Bolton, Kingsley. 2006. Chinese Englishes. Cambridge: Cambridge University Press.

Britain, David. 2010. “Grammatical variation in the contemporary spoken English of England”. The Routledge Handbook of World Englishes, ed. by Andy Kirkpatrick, 37–58. London & New York: Routledge.

Canagarajah, Suresh. 2013. Translingual Practice. New York: Routledge.

Chen, Yu-Hua, Simon Harrison, David Oakey, Godwin Ioratim-Uba, Michael Stevens & Shanru Yang. Submitted. “Beyond borders, beyond words: Developing a multimodal corpus of L2 academic English from an EMI university in China”.

Chen, Yu-Hua, Simon Harrison, Michael Stevens, Qianqian Zhou & Radovan Bruncak. Submitted. “Developing transcription and annotation conventions for a multimodal EMI corpus UNNC-CAWSE: Issues, challenges and prospects”.

Cogo, Alessia & Martin Dewey. 2012. Analysing English as a Lingua Franca: A Corpus-Driven Investigation. London: Continuum.

Dagneaux, Estelle, Sharon Denness & Sylviane Granger. 1998. “Computer-aided error analysis”. System 26(2): 163–174.

Deterding, David. 2007. Singapore English. Edinburgh: Edinburgh University Press.

Deterding, David. 2017. “The pronunciation of English in Guangxi: Which features cause misunderstandings?” Researching Chinese English: The State of the Art, ed. by Zhichang Xu, Deyuan He & David Deterding, 17–31. New York: Springer.

Ellis, Rod. 2007. “The differential effects of corrective feedback on two grammatical structures”. Conversational Interaction in Second Language Acquisition: A Collection of Empirical Studies, ed. by Alison Mackey, 339–360. Oxford: Oxford University Press.

García, Ofelia. 2009. Bilingual Education in the 21st Century. Chichester: Wiley-Blackwell.

Hamid, M. Obaidul & Richard B. Baldauf Jr. 2013. “Second language errors and features of world Englishes”. World Englishes 32(4): 476–494.

He, Deyuan & David Li. 2009. “Language attitudes and linguistic features in the China English debate”. World Englishes 28(1): 70–89.

Hemchua, Saengchan & Norbert Schmitt. 2006. “An analysis of lexical errors in the English compositions of Thai learners”. Prospect 21(3): 3–25.

Hirschmann, Hagen, Seanna Doolittle & Anke Lüdeling. 2007. “Syntactic annotation of non-canonical linguistic structures”. Proceedings of the Corpus Linguistics Conference (CL2007), ed. by Matthew Davies, Paul Rayson, Susan Hunston & Pernilla Danielsson. Birmingham: University of Birmingham. http://ucrel.lancs.ac.uk/publications/CL2007/

Jenkins, Jennifer. 2000. The Phonology of English as an International Language. Oxford: Oxford University Press.

Jenkins, Jennifer. 2006. “English pronunciation and second language speaker identities”. The Sociolinguistics of Identity, ed. by Tope Omoniyi & Goodith White, 75–91. New York: Continuum.

Jenkins, Jennifer. 2014. English as a Lingua Franca in the International University: The Politics of Academic English Language Policy. London & New York: Routledge.

Jenkins, Jennifer. 2015. “Repositioning English and multilingualism in English as a Lingua Franca”. Englishes in Practice 2(3): 49–85.

Kirkpatrick, Andy. 2003. “English as an ASEAN Lingua Franca: Implications for research and language teaching”. Asian Englishes 6(2): 82–91.

Kirkpatrick, Andy. 2007. World Englishes: Implications for International Communication and English Language Teaching. Cambridge: Cambridge University Press.

Kirkpatrick, Andy & Zhichang Xu. 2002. “Chinese pragmatic norms and ‘China English’”. World Englishes 21(2): 269–279.

Li, David C.S. 2010. “When does an unconventional form become an innovation?” The Routledge Handbook of World Englishes, ed. by Andy Kirkpatrick, 617–633. London & New York: Routledge.

Li, Wei. 2018. “Translanguaging as a practical theory of language”. Applied Linguistics 39(1): 1–23.

Liang, Jianli & David C.S. Li. 2017. “Researching collocational features: Towards China English as a distinctive new variety”. Researching Chinese English: The State of the Art, ed. by Zhichang Xu, Deyuan He & David Deterding, 61–75. New York: Springer.

Long, Michael H., Shunji Inagaki & Lourdes Ortega. 1998. “The role of implicit negative feedback in SLA: Models and recasts in Japanese and Spanish”. The Modern Language Journal 82(3): 357–371.

Mauranen, Anna, Niina Hynninen & Elina Ranta. 2010. “English as an academic lingua franca: The ELFA project”. English for Specific Purposes 29(3): 183–190.

Milroy, James & Lesley Milroy. 2012. Authority in Language: Investigating Standard English. New York: Routledge.

Nelson, Cecil L. 2011. Intelligibility in World Englishes. New York: Routledge.

Nelson, Gerald. 2004. “Negation of lexical have in conversational English”. World Englishes 23(2): 299–308.

Nicholls, Diane. 2003. “The Cambridge Learner Corpus - error coding and analysis for lexicography and ELT”. Proceedings of the Corpus Linguistics Conference (CL2003), ed. by Dawn Archer, Paul Rayson, Andrew Wilson & Tony McEnery, 572–581. Lancaster: Lancaster University. http://ucrel.lancs.ac.uk/publications/CL2003/

Ohta, Amy. 2001. Second Language Acquisition Processes in the Classroom: Learning Japanese. Manwah, NJ: Lawrence Erlbaum.

Osimk-Teasdale, Ruth & Nora Dorn. 2016. “Accounting for ELF: Categorising the unconventional in POS-tagging the VOICE corpus”. International Journal of Corpus Linguistics 21(3): 372–395.

Pingali, Sailaja. 2009. Indian English. Edinburgh: Edinburgh University Press.

Pitzl, Marie-Luise, Angelika Breiteneder & Theresa Klimpfinger. 2008. “A world of words: Processes of lexical innovation in VOICE”. View 17(2): 21–46.

Ren, Wei. 2017. “Pragmatics in Chinese graduate students’ English gratitude emails”. Researching Chinese English: The State of the Art, ed. by Zhichang Xu, Deyuan He & David Deterding, 109–124. New York: Springer.

Seidlhofer, Barbara. 2011. Understanding English as a Lingua Franca. Oxford: Oxford University Press.

Sheen, Younghee. 2007. “The effects of corrective feedback, language aptitude, and learner attitudes on the acquisition of English articles”. Conversational Interaction in Second Language Acquisition: A Collection of Empirical Studies, ed. by Alison Mackey, 301–322. Oxford: Oxford University Press.

Sifakis, Nicos C. & Natasha Tsantila. 2019. English as a Lingua Franca for EFL Contexts. Bristol: Multilingual Matters.

Siqi, Li & Andrew Sewell. 2012. “Phonological features of China English”. Asian Englishes 15(2): 80–101.

Smith, Larry. 1998. “English is an Asian language”. Asian Englishes 1(1): 172–174.

Stevens, Michael, Yu-Hua Chen & Simon Harrison. In press. “Issues and challenges in building a multimodal corpus of L2 English”. Variation in Time and Space: Observing the World through Corpora (Diskursmuster - Discourse Patterns), ed. by Ingo H. Warnke & Beatrix Busse. Berlin: De Gruyter.

Swain, Merrill & Sharon Lapkin. 1998. “Interaction and second language learning: Two adolescent French immersion students working together”. The Modern Language Journal 82(3): 320–337.

Swan, Michael. 2012. “ELF and EFL: Are they really different?” Journal of English as a Lingua Franca 1(2): 379–389.

Tupas, Ruanni. 2010. “Which norms in everyday practice: And why?” The Routledge Handbook of World Englishes, ed. by Andy Kirkpatrick, 567–579. Abingdon: Routledge.

Walker, Robin. 2010. Teaching the Pronunciation of English as a Lingua Franca. Oxford: Oxford University Press.

Xu, Zhichang. 2008. “Analysis of syntactic features of Chinese English”. Asian Englishes 11(2): 4–31.

Xu, Zhichang. 2010. Chinese English: Features and Implications. Hong Kong: Open University of Hong Kong Press.

Xu, Zhichang & David Deterding. 2017. “The playfulness of ‘new’ Chinglish”. Asian Englishes 19(2): 116–127.

Xu, Zhichang, Deyuan He & David Deterding, eds. 2017. Researching Chinese English: The State of the Art. Singapore: Springer.


:: elongated sound
(.) micropause
((gesture)) action or gesture in double parentheses
{translation} translation for codeswitching


University of Helsinki