Applying existing tagging practices to VOICE

Ruth Osimk-Teasdale
VOICE Project, Department of English, University of Vienna


Research on English as a lingua franca (ELF) has attracted increasing interest in recent years. An important development in this was the release of the Vienna Oxford International Corpus of English (VOICE) in May 2009, the first free-to-use resource containing over one million words of naturally occurring transcribed spoken ELF conversations. To increase its usability, we are currently investigating the possibilities of assigning part-of-speech (POS) tags to the VOICE Corpus. [1] The nature of the data makes this a highly challenging task and raises a number of issues.

The main issue addressed in this paper is the extent to which existing tagging systems of available L2 corpora can be applied to a corpus like VOICE. The first part of this paper is a review and evaluation of POS tagging practices which have been used previously for spoken learner corpora, with reference to their applicability to VOICE. The POS tagging practices commonly used in SLA research have traditionally taken an error-based approach, viewing divergent L2 forms as steps in acquiring the target language system, and therefore deficient in relation to L1 output. The approach to ELF adopted in the VOICE project, on the other hand, takes a ‘difference’ rather than ‘deficiency’ perspective and unconventional forms are considered to be communicatively motivated rather than mere shortcomings of the target language form or ‘errors’. In this respect, ELF speakers are viewed as goal-oriented users of the language in their own right, and clearly differentiated from learners of English. Clearly, this position needs to be reflected in POS practices adopted for an ELF corpus like VOICE.

Part two of the paper then addresses in particular the issue of unconventional forms in connection with the form-function relationship that is realized in ELF interactions (cf. Seidlhofer 2009a) and discusses how this central issue in ELF might be accounted for with available tagging systems, e.g. SLAtagging (Rastelli 2009). The paper raises broader questions about how L2 language (cf. Cook 2002), whether conceived of as a learner language (EFL, SLA) or user language (ELF), can be adequately pre-processed in terms of POS tagging. This is tested in a small pilot study undertaken on a small sample of VOICE data.

1. Introduction

The aim of this paper is to highlight some issues that arise in the POS-tagging of VOICE data and outline their theoretical implications by reference to a small pilot study. In section 2, the user perspective adopted in ELF research is discussed as especially relevant to ELF research in general, and to the compilation of ELF corpora more specifically.

Section 3 addresses an issue that arises from the very nature of this user data, namely that of a heightened number of uncodified items, resulting from the language output of the multilingual ELF speakers. Two types of uncodified items are especially relevant when undertaking the task of POS tagging ELF: on the one hand, that of <pvc>s (pronunciation variation and coinages), i.e. of words not present in our reference dictionary, and on the other hand, that of non-ENL relationships of paradigmatic forms and syntagmatic functions. Such instances of linguistic innovation are usually viewed positively in native speaker corpora but considered erroneous in L2 corpora in comparison to native speaker data. It has been argued in ELF research that such comparison has the effect of actually concealing interesting manifestations of L2 user language.

In section 4, two traditional tagging practices of L2 data, namely error tagging and SLA-tagging, are discussed with regard to their relevance to ELF data. Firstly, it is argued that if adopting a user perspective is a key criterion for POS tagging VOICE, the perspective of deviation from TL norms usually assumed by error tagging for learner corpora, is not suitable for an ELF corpus. Secondly, the approach of SLA-tagging with TreeTagger is discussed and compared to the requirements for POS-tagging ELF. Despite the differences between these approaches, TreeTagger is proposed as potentially suitable for tagging VOICE.

To investigate the performance of TreeTagger on VOICE, a small, initial pilot study was undertaken, which is described in section 5. Both generally, as well as on <pvc>s, the performance of TreeTagger was found to be a reasonable starting point for POS-tagging VOICE. At the same time, the pilot study pointed to a number of areas for improving the tagger’s performance.

2. English as a lingua franca – the user perspective adopted in VOICE

It is widely acknowledged that the English language today is used in a multitude of situations all over the globe. Rather than calling this ‘English’, which conveys the impression of being a single language variety, sociolinguistics acknowledges the variety of forms and functions English serves today by referring to it as ‘Englishes’ (e.g. McArthur 1998, Jenkins 2009). It has been recognized that the uses of English go far beyond those in inner and outer circle countries (Kachru 1985), where English has official status, to the expanding circle, where it has no such official status. More recently, it has been recognized that these speakers too are users and not merely learners of English. In expanding circle settings, English is often the “common means of communication between speakers from different first-language backgrounds” (VOICE Project Website) and therefore a lingua franca for all interlocutors. This definition makes it clear that the main focus in ELF research is on issues to do with language ‘usage’ in all kinds of contexts for real-life purposes rather than on language learning, a distinction which we will return to later.

Given the crucial role of English as a global means of communication in the expanding circle, it is not at all surprising that the nature of these communications in English as a lingua franca (ELF) have been of increasing interest to researchers, and that this research has gathered momentum within the last decade. Research into ELF has primarily been concerned with spoken language, often highly interactive, and has focused on different domains of usage, largely in higher education (e.g. Björkman 2009, Breiteneder 2009, Ranta 2009, Smit 2010), and business (Ehrenreich 2009, Pitzl 2005, 2010, Pullin-Stark 2009). [2] More recently, the usage of ELF in other domains has also been investigated, such as in written business communication (e.g. Bjørge 2007, Clayson-Knollmayr 2010, Kankaanranta 2005) or private conversations, e.g. couples’ talk (Gundacker 2010, Kloetzl 2010). Such work has often been based on individual data collections of different sizes. However, in more recent years a number of ELF corpora have also been completed and are currently being compiled, e.g. the Tübingen English as a Lingua Franca corpus (TELF), and the Asian Corpus of English (ACE). Two corpora which have been completed include the Corpus of English as a Lingua Franca in Academic Settings (ELFA) and the free-to-use Vienna Oxford International Corpus of English, which was released in May 2009.

The 151 speech events in VOICE, with a total size of over one million words, comprise interactions between 753 individual speakers with 49 different first languages. The speech events are taken from 3 different domains (professional, educational and leisure) and classified into 10 different speech event types (e.g. conversations, service encounters, working group discussions). [3] All speech events fulfil the criteria that they are interactions which are spoken, naturally occurring, varyingly interactive, face-to-face, non-scripted and that participation is self-selected, i.e. would have also taken place had no one been there to record them (VOICE Project Website; Breiteneder et al. 2006: 164 ff.; Breiteneder et al. 2009).

3. Theoretical considerations for POS tagging VOICE – the challenge of uncodified items

After the release of VOICE 1.0 Online, the main focus in the current stage of the project has been on increasing the applicability and usability of the corpus. Part-of-Speech or POS tagging, i.e. assigning a word class category to each word in the corpus, offers one possibility to achieve this. However, the application of this to a corpus of spoken ELF presents a number of significant challenges. Many of these challenges are of course not unique to the POS tagging of VOICE. With regard to spoken language for example, other corpora face similar challenges: how, for example, to deal with incomplete starts or repetitions, how to implement standardized discourse markers (e.g. hesitations, backchannels etc.) or how to chunk highly interactive spoken discourse (containing overlaps etc.)? On top of these issues however, the nature of ELF data in itself poses a considerable challenge. The input, shaped by multilingual speakers, presents includes features such as code-switching, the production of non-codified items and the matching of paradigmatic forms and syntagmatic functions which do not correspond with those in English as a native language (ENL). This paper will focus specifically on the aspects of non-codified items and the form-function relationships.

Any corpus of natural language will have to deal with issues of non-codified language, although depending on the type of corpus to different degrees, i.e. spoken corpora inherently contain more non-codified items than written corpora and dialect corpora more than those consisting of standardized spoken language of e.g. scripted media discourse. But in any natural language, speakers will explore what is possible in the language system in order to suit their communicative needs and therefore coin new items. [4] Newly coined items are, by definition, not codified. Given the lingua-cultural settings of a ‘contact language’ such as ELF, it is perhaps not surprising that in VOICE a considerable number of such items are observed. [5] Where speakers engage with real-life tasks in a language that is not native to them, the creative exploitation of linguistic resources is often an important strategy for achieving their communicative goals (cf. e.g. Pitzl 2009, 2011).

In the VOICE corpus, the annotation <pvc> (pronunciation variation and coinages) is used to highlight items which were not defined in the reference dictionary, the 7th edition of the Oxford Advanced Learner’s Dictionary (henceforth OALD7), and show “[s]triking variations on the levels of phonology, morphology and lexis as well as ‘invented’ words” (VOICE Project 2007: 4). [6] In total, 1141 types (2116 tokens) have been annotated with the <pvc> tag in VOICE 1.0 (VOICE 2011). By capturing any items not found in OALD7, the <pvc> tag provided a starting point for later analysis (Breiteneder et al. 2006: 181). It is important to note that the decision on the definition of the <pvc> tag was “purely operational” (Pitzl et al. 2008: 5), and that the words captured under the <pvc> tag may include a range of different items, some of which might indeed be found in other corpora or dictionaries. [7] Beside the many newly coined words, such as luckywise, also specialized terminology e.g. the medical term cytokines or presumed slips of the tongue (examples (4) and (5) are included. However, the vast majority of these <pvc>s constitute newly coined words which follow the rules of attested word formation processes such as affixation, analogy, reduction, borrowing and addition, to name the most common (Pitzl et al. 2008), but have mostly not been codified in ENL (examples (1) to (3)).

(1) S1: i was just like (.) <pvc> putted </pvc> things in (LEcon565:378, L1=ita-IT) [8]
(2) S1: it was like a surreal <pvc> inscenation </pvc> or something (LEcon573:76; L1=ger-DE)
(3) S1: with a diverse <pvc> linguistical </pvc> (.) group. (EDwsd303:387, L1=dut-NL)
(4) S2: one could say that er in a period of economic <pvc> unsiternity {uncertainty} <ipa> ʌnsɪˈtɜːrnɪtɪ </ipa> </pvc> o:f er (PRpan294:91, L1=slo-SK)
(5) S4: the LAST but one paragraph it says <reading_aloud> a <pvc> comprison {comparison} <ipa> ˈkɔmprɪzən </ipa> </pvc> (.) <pvc> comprision {comparison} <ipa> ˈkɔmprɪʒən </ipa> </pvc> of the evaluation guidelines to final evaluation reports </reading_aloud> (POmtg403:174, L1= hun-HU)

Because of the practical definition of the <pvc> tag, certain variations potentially relevant for assigning POS tags are not included if the word form itself is an existing entry in the reference dictionary. As Pitzl et al. (2008: 26) state, “[a]n ‘existing word‘ may also be used with an entirely new or different meaning or it can be used in another syntactic category. Yet, as long as the word itself can be found in the OALD7 it is not tagged as a <pvc>.” (see examples (6) to (11))

The implication is, therefore, that a second aspect is not captured by the <pvc> tag, namely non-ENL form-function mapping, i.e. the form exists but does not co-occur with its codified function. This aspect poses possibly an even greater challenge when dealing with ELF data. This is not to suggest, of course, that this phenomenon is unique to ELF or non-native speaker discourse in general, for form-function relationships are often not clear-cut in ENL either. Although paradigmatic forms and syntagmatic functions do mostly agree in ENL, many unclear cases arise, e.g. the inflection –ing (usually attached to verbs, but can also function as adjectives or nouns; Atwell 2008: 507), differences between prepositions and particles, e.g. in here or in there (Johansson 2004: 11).  However, as ELF data has been shown to contain more variation than corpus data where the speakers share a common L1, this is also to be expected in the area of the form-function relationships. Examples (6) to (11) demonstrate such instances of non-ENL form-function mapping in VOICE data (plain style, bold highlighting is the author’s). These include the use of a zero morpheme where Standard English requires 3rd person present tense marking, e.g. (6) and (7) and zero marking of plural nouns as in (8), non-ENL past tense constructions, e.g. (9) and (10), as well as forms of adjectives which occur in typical adverb positions as in (11).

(6) S2: you can TASTE it it taste even of milk but it's FINE. (LEcon566:183, L1=ita-IT)
(7) S1: the waiter is NOT somebody who hate americans (EDsed31:964, L1=ger-AT)
(8) S2: one can apply (at) this italian agency for (.) two month (.) (PRcon550:30, L1=slv-SI)
(9) S1: i don't want erm let's say this way i also didn't spoke to [first name34] in france (PBmtg27:527, L1=ger-DE)
(10) S1: in [org11] video camera has break down (PBmtg27:653, L1=ger-DE)
(11) S11: but er (1) we are <3> complete different </3> (EDsed31:1134, L1=ita-IT)

For tagging purposes, we can therefore posit two main types of non-codified items in VOICE data. First, those which fulfil the criteria of a grammatical category (through word-formation processes, e.g. affixation), but happen not to have been codified in our reference dictionary (realisations of the virtual language, annotated as <pvc>s in VOICE). And secondly, those items with an existing form entry in the reference dictionary which does not co-occur with its codified function (not annotated at present in VOICE).

Again, the examples we come across are, of course, not entirely unique to ELF. The same or similar occurrences can be found in any corpus of naturally occurring English, as examples (12) and (13) taken from native English corpora show.

(12) Yeah I know our Lorraine has took the phone over. (BNC: KBE 308) [9]
(13) It didn't hurt when I fell over cos the road's real soft and spongy (COCA: A74 2982)

However, while in native corpora cases like these are usually viewed positively as interesting instances of dialectal variation, when produced by L2 language users, they have often been regarded as erroneous, approximations to the target language (TL) system that fall short of a native speaker standard. Rastelli (2009: 58) describes this as, “systematicity (or lack) in how learners map forms and functions and in how they gradually develop their knowledge of the TL categories out of the available input”. From an ELF perspective however, rather than regard these as strategies of learning, as they often are in learner corpus research, “the very same utterances can be regarded as communication strategies” (Seidlhofer 2001: 144). Similarly, rather than dismissing them as “misspelled, badly uttered, incomprehensible and non-interpretable items” (Rastelli 2009: 57), unconventional forms are taken to be interesting manifestations of a lingua franca in use as they shed light on the underlying communicative functions they serve. Seidlhofer (2009b: 241) argues that

the crucial challenge has been to move from the surface description of particular features, however interesting they may be in themselves, to an explanation of the underlying significance of the forms: to ask what work they do, what functions they are symptomatic of.

The adoption of this user perspective for all of Kachru’s (1985) circles, including an expanding circle context, is a key characteristic of ELF research. This includes all the implications of how the users’ languages are viewed and this is implemented in research practices. The recognition that speakers in ELF conversations are language users, making deliberate linguistic moves, rather than being learners applying “compensatory actions […] problematic exchange or helping strategies to make up for non-understanding” (Cogo 2009: 259) as viewed in literature on learner language, although not always followed through consistently, generally guides the methodological approaches and presumptions of ELF researchers. It is of interest which strategies, forms and functions the speakers use to achieve these goals, rather than how they fail to do this in exactly the same way that a (rather small) group of native speakers speaking a standard variety would. [10] Based on her longitudinal investigation on tertiary classroom discourse, Smit (2009: 202) argues that an approach which focuses on the description of L2 language as deviating from L1 actually misses out on describing features of ELF discourse. Therefore, the advantage of viewing ELF speakers as language users rather than learners is that it sheds light on a number of aspects which might otherwise be overlooked or disregarded.

For example, ELF speakers have been shown to use their multilingual repertoires successfully in order to go about their daily business and achieve their intended (e.g. communicative) goals. Klimpfinger (2009: 352, 359 ff.) identifies 4 main functions of code-switching in ELF, namely “specifying an addressee, signalling culture, appealing for assistance, and introducing another idea”. Another area, the adaptation of certain lexical forms, such as 3rd person singular zero marking, has been shown to serve the function of accommodation between interlocutors (Cogo & Dewey 2006: 84), and, more generally, “to acknowledge understanding, ensure the smooth development of the conversation, the synchrony of its delivery, and alignment” (Cogo 2009: 259). Similarly, for most unconventional items of both kinds described above for VOICE, whether newly coined words or non-ENL form-function correspondences, there are indications that the forms are nevertheless functionally motivated. Pitzl et al. (2008: 40ff.) demonstrated that for 247 <pvc> items of a sub-corpus of a total of 250042 words, four main functions of these forms were to make expressions clearer, more regular, efficient or to coin a word where there is a lexical gap in English.

4. Possible approaches to assign POS tags to VOICE data

Given this recognition of the functional motivation for formal variation, the methodological question arises of how to deal with both types of non-codified items in POS tagging VOICE and the applicability of existing tagging practices in accounting for them. As mentioned earlier, most L2 corpora are concerned with learner data, and traditionally investigate how the language of the learners deviates from a TL norm, e.g. through ‘over’- and ‘underusage’ of grammatical categories (Granger 2008: 267). Such learner corpora have the common practice of assigning ‘error tags’ to any items which deviate from the TL. [11] Moreover, forms are ‘error-tagged’ according to the assumed TL goal whenever a difference between the produced forms and TL functions arises. For example, non-use of 3rd person ‑s as in *he jump, would be tagged as an error according to TL function. However, error tagging,  with its implicit view of L2 users’ language as erroneous and as deficient compared to L1 language, is clearly not suitable for an ELF corpus, where speakers seek to communicate  rather  than to conform to the norms of a TL community. [12]

An alternative to Error Tagging is SLA-tagging, as carried out by Rastelli (2009), who used an L1 Tagger (TreeTagger) to tag L2 language with the goal of investigating the interlanguage system. Rastelli (2009) criticises the practice of error-tagging, i.e. the mapping of functions onto deviant forms for committing the comparative fallacy and the closeness fallacy (Rastelli 2009: 61). The comparative fallacy means that one language system is investigated in reference to the expected target language output; the closeness fallacy refers to the danger of assigning a function that is similar to the target language output in form, even though the reasons for this resemblance are covert. Although there are parallels between the approaches of SLA-tagging and the theoretical considerations for tagging ELF data, with the notion of regarding non-codified items as interlanguage,  SLA-tagging still differs considerably from the ‘difference’ approach adopted in ELF. Table 1 shows a comparison of the three approaches discussed: error tagging, SLA-tagging and tagging ELF.

  Speaker is viewed as… Non-codified items Mapping of forms and functions GOALS
Error-tagging Language learner Errors ⇒ error tag is attached according to assumed TL goal
e.g. *he jump 3rd person singular
Investigate language of learners
(a) to gain new insights for language pedagogy and SLA
(b) by using methods from SLA which compare the learner language to native standard (CA, CEA)
SLA-tagging L2 User (though TL is goal) Manifestations of interlanguage ⇒ ‘virtual categories’ form and functional interpretation kept separate in order to investigate the interlanguage system & its mechanisms (a) Investigate systematicity of form-function mapping and development of TL categories with learners
(b) Reveal unexpected features of the learner language
ELF-tagging Language user (ENL not necessarily the goal) Manifestations of virtual language ⇒ tagging? as yet unexplored Development of tagging system which is meaningful for ELF research, e.g. forms/function usage for communicative purposes

Table 1. Comparison of the approaches of Error tagging, SLA-tagging and the prospect of ELF tagging.

For the purpose of POS tagging a corpus of Italian L2 speakers, instead of error-tagging non-codified items, Rastelli used TreeTagger (cf. Schmid 1994). TreeTagger is a probabilistic, decision-tree based L1 tagger which assigns tags according to lexical root, morphology and context, using a decision tree. Moreover, TreeTagger assigns confidence rates to tags, e.g. a target word is assigned a 70% likelihood for it to be a noun and a 10% likelihood for an adjective (cf. Figure 1). The researcher can decide whether to output high or low confidence rates with analytical corpus tools. Rastelli (2009: 60) views the POS tags which TreeTagger assigns as “virtual” categories, rather than “psychological realia in a learner's mind” (2009: 59). He suggests that, in order to better investigate the language learners’ language systems, the matching of forms and TL function should be done at the later, analytical, stage. The method he used in his study was to use low confidence rates of tags as indicators for form-function mismatches and as a starting point for analysing the interlanguage system. [13]

Figure 1. A sample decision tree (taken from Schmid 1994)

The approach of SLA-tagging used by Rastelli is novel in the sense that it tries to view L2 user language in its own right, and exemplifies the successful application of an L1 tagger to L2 data. However, the method of using low confidence rates to reveal non-standard form-function relationships might not always yield entirely reliable results for other types of data. This might be especially true when the tagger is applied to different kinds of data than the one TreeTagger was trained on, as the probabilities the tagger outputs in this method are entirely dependent on the original training material. A second related problem to be aware of is that for words in the fullform lexicon, TreeTagger only selects between the available tag possibilities. So, for example, if text is only assigned the tag noun (NN), it will not be tagged as a verb even if it occurs as such in the corpus. While this is a problem for any kind of corpus, it might especially restrict analyses of L2 data where it is necessary to keep an open mind for non-codified form-function relationships. Rastelli (2009: 63) grants that the TL form-function match from the training material might be “too restrictive” for L2 data and suggests that “if we want to keep the idea of ‘virtual categories’, more permissive conditions need to be introduced in the decision tree and high/low probability vectors need to be consequently adjusted and fine-tuned“.

Although these aspects have to be borne in mind, they do not necessarily rule out TreeTagger as a viable tool for POS-tagging VOICE. Firstly, TreeTagger has been shown to achieve excellent results in comparison with other taggers, e.g. operating on Hidden Markov Models (HMM) and Memory-based Learning (Nøklestad & Søfteland 2007), HMM and Brill Tagger (Allauzen & Bonneau-Maynard 2008) and a Trigram Tagger (Schmid 1994). Secondly, TreeTagger has been shown to achieve high accuracy even with a small training corpus (Schmid 1994), a necessity as VOICE is a rather specialized, small-size corpus and any training data would need to be created from the ground up, as there is currently no POS tagged ELF training data available. TreeTagger might achieve better results than other taggers trained on such small amount of data. Furthermore, TreeTagger does not require input which has been chunked into sentences, which is advantageous for VOICE, as chunking of spoken, interactive data will not be unproblematic and would require a time-consuming additional stage of annotation which would need to precede POS tagging. Also, TreeTagger seems to be a promising option with regard to the treatment of non-codified items in ELF data: it has been shown to deal particularly well with unknown words (Volk & Schneider 1998), as well as data from “particularly difficult [i.e. not standardized] genres”, such as online fora and TV episode guides (Giesbrecht & Evert 2009) and has been used on L2 data to separate form and function (Rastelli 2009). [14] Another path worth exploring for the application of TreeTagger to ELF data might be to take the non-error-approach pursued by Rastelli (2009), while bearing in mind the limitations already mentioned, to reveal unexpected items in L2 data through outputting “low confidence rates” (Rastelli 2009: 62). Finally, TreeTagger uses the Penn tagset. Compared to other common tagsets for English, e.g. CLAWS7 with 137 tags, the Penn tagset is rather small with only 36 different tags. This means that the assumed word class categories are relatively broad. For classifying spoken, highly interactive ELF data with predesigned L1 categories, such broad categories seemed more suitable than a very fine-grained tagset, as they decrease the amount of tag ambiguities resulting from the nature of the data. The Penn tagset served as a starting point, and is currently being adapted and extended to suit the needs our data (cf. Osimk 2011).

5. Pilot study: Test-tagging VOICE – example: <pvc>s

In order to determine the accuracy of TreeTagger for VOICE data, a pilot study on a small sample of the VOICE corpus was undertaken. For reasons of feasibility, this study focuses on examples of the first type of non-codified items mentioned, namely those annotated with the <pvc> tag. Two aspects regarding the performance of TreeTagger on VOICE data were to be measured: firstly, to determine how well TreeTagger performed on unknown items, i.e. those annotated as <pvc>. The second objective was to carry out preliminary tests on the overall tagging performance of TreeTagger VOICE data, in order to gain more insight into problematic and error-prone areas to be aware of in the subsequent tagging process.

5.1. Test sample & methodology

The test sample consisted of all utterances containing a <pvc> from 4 speech events taken from the leisure domain (LEcon565, LEcon566, LEcon573, LEcon575), with a total of 543 tokens. For all 4 events, the same two speakers participated in the conversation, one of which was an L1 Italian and the other an L1 German speaker. The two speakers are a couple, conversing informally over topics such as their professional life, preparing food, travel, common friends and plans for the weekend. The test sample was chosen to obtain an initial, tentative impression of how TreeTagger – a tagger which was originally trained on L1 data but had been used on L2 data before – performed on VOICE data. Although certainly not representative for the whole corpus, the sample was a useful example of VOICE data for the purposes of this pilot study, as it contained a large number of <pvc>s and was of medium interactivity. The Penn tagset was later expanded by other data complementary in terms of interactivity, speech event type and first languages of speakers.

The speech events were chunked according to utterances, as utterances are units which can be automatically retrieved from the corpus mark-up. We generated all utterances containing a <pvc> item from the 4 speech events, resulting in 20 utterances with a total of 543 words, of which 21 were annotated <pvc>. All tokens of the 20 utterances were tagged manually, and then with TreeTagger in GATE, using the PennTreebank Tagset for both. TreeTagger performed under standard settings. Solely the settings for tag probabilities were kept deliberately low (0.00001), to ensure that the output included all possible tags, even those with a low probability. The manually assigned tags served as the reference standard, henceforth referred to as ‘gold standard’.  The tag TreeTagger output with the highest probability was then compared to the manually assigned tag.

5.2. Results and discussion

Overall, 459 of the total 543 tokens assigned by TreeTagger were in agreement with the manual tags, reflecting an accuracy of 84.5%. A number of areas could be identified in which TreeTagger assigned a different tag than was assigned by manual annotation. An example includes the non-capitalized ‘i’, which was tagged as proper noun (NP) rather than a personal pronoun (PP). This is something to be aware of for tagging VOICE, as capitalization is used exclusively for signalling stress in the corpus. Secondly, base form verbs (VV) in questions were tagged as present tense verbs (VVP) or present tense verb ‘have’ (VHP) rather than base form verbs (VV)  or base form have (VH), e.g. (14) and (15).

(14) S2: how do you say_VVP anachrom (LEcon565:23)
(15) S1: do you mind (if i have_VHP) it a bit more (LEcon566:233)

An accuracy of 84.5%, although relatively low compared to the gold standard, can be regarded as a reasonable starting point for the further tagging process. This is especially true considering that TreeTagger was trained on a set of data from the Penn Treebank, which is in many respects very different to the spoken and highly interactive character of the data and the nature of ELF.

With regard to the 21 <pvc>s contained in the retrieved utterances, the manual tag and Treetagger agreed for 12 (gray cells) and disagreed for 9 (red cells). The <pvc>s are listed in Table 2.

Pvc Word with immediate co-text Manual tag TreeTagger
1 is it spanishy or JJ JJ 0.929969
2 or portuguesey whatever shop JJ NN 0.730337
3 you say anachrom NN NN 0.532445
4 just like putted things in VVD VVN 0.989772
5 is slightly liquidy but JJ JJ 0.794834
6 is more liquidy yeah JJ NN 0.889958
7 it isn't liquidy i think JJ JJ 0.914492 
8 never then chinesey ones JJ JJ 0.905536
9 have something liquidy then JJ NN 0.975937
10 not really softish huh JJ JJ 1.000000
11 just like claustrophobicy get JJ NN 0.987108
12 they look all frenchers NNS NNS 0.918265
13 but slutty JJ JJ 1.000000
14 and those slutty forty year JJ JJ 1.000000
15 a surreal inscenation or something NN NN 0.996255
16 grey zone anyways  RB RB 1.000000
17 are not dimensioned we can't JJ VVD 0.501130
18 did not re-enrol he probably VV NN 0.682976
19 students don't re-enrol for VV NN 0.809839
20 you didn't re-enrol VV NN 0.510386
21 create the sotteck socket NN NN 0.720744

Table 2. The 21 extracted <pvc> items: <pvc>s highlighted in bold with immediate co-text. Cases in which manually assigned tag and tag assigned by TreeTagger disagreed are highlighted in red.

In the tagging process, TreeTagger uses a fullform, a suffix and a default lexicon. In the tagging process, the fullform lexicon is searched first, if it cannot be found there, the suffix lexicon is searched for the ending of the unknown word (cf. Schmid 1994). Each item in the suffix lexicon has a number of associated word class probabilities. Whether or not TreeTagger assigns a correct tag depends on the probabilities attached in the suffix lexicon, as well as the preceding co-text of the word. The overall probabilities are then calculated, using equivalence classes (Schmid 1999). This explains why liquidy is tagged as an adjective (JJ) after adverbs (examples (15) and (16)), but as noun (NN) after nouns (18) and comparative adverbs (19).

(16) is_VBZ slightly_RB liquidy_JJ but
(17) it is_VBZ n't_RB liquidy_JJ  i think
(18) is more_liquidy_NN yeah
(19) have_VBZ something_NN liquidy_NN  then

Generally, this result of 12 out of 21 correctly assigned tags shows that the suffix lexicon, which was generated from native speaker data, is, at least to a limited extent, applicable to ELF data. For the 9 cases where the tag assigned by TreeTagger did not correspond with the manually assigned tag, TreeTagger assigned a different tag which is also possible according to its form. This can be said to support Pitzl et al.’s (2008) findings that <pvc>s are coined according to attested word formation processes. However, the accuracy is too low to allow <pvc>s to be tagged automatically. In order to improve the overall tagging result, it will be more promising in the further process to either tag <pvc>s semi-automatically (with manual check-processes) or to assign tags manually and then add them to the lexicon of TreeTagger. The experience of tagging the <pvc> items for the sample corpus showed that manual tagging is also a viable option. Even though <pvc>s are all non-codified in the reference dictionary, by considering both form and co-text, they could be assigned POS tags unambiguously.

We then wanted to analyse to what extent the general tagging performance of TreeTagger could be improved when some problematic areas (which could be added to the lexicon in the further tagging process) were removed. In order to do this, the sample was tested while ignoring 11 VOICE-specific items (IPAs, incomplete words, proper nouns, breathing, unintelligible speech). We also ignored the 21 <pvc>s, as their occurrence in the retrieved utterances (3.9%) is much higher in the chosen sample and thus not representative of the occurrence in the overall corpus (0.2%), and also because they may be manually added to the lexicon in the further tagging process. When ignoring the 21 <pvc>s and 11 VOICE-specific items, the accuracy increased slightly to 87.5 %. This result is considered a reasonable starting point for the further tagging process, but it also means that it will be necessary to expand the manually annotated material before a more thorough analysis of the error types made by TreeTagger on VOICE data can be conducted.

6. Conclusion

Assigning word class categories to VOICE data presents many challenges, arising from the very multi-facetted nature of ELF data. In this paper, some of these challenges were outlined and a number of initial issues concerning the POS tagging of non-codified items in our data addressed. In tagging a sample corpus, we investigated how TreeTagger performed on our data in general, and on <pvc>s in particular. The tagging performance of 84.5% with TreeTagger, whilst leaving considerable room for improvement, does provide a basis for the further tagging process. The performance of TreeTagger on <pvc>s, although not sufficient for fully automatic tagging, showed that the suffix lexicon trained on native speaker data generally can be applied to ELF data. Further work will tackle the question of how to deal with non-ENL form-function relationships in tagging VOICE, which could only be briefly touched on in this paper. This is of course a question which has major theoretical and practical implications and will have to be investigated in greater depth in developing a practical tagging process that will capture the distinctive features of ELF. Finally, it should be added that apart from the described results, it has become apparent that the very process of applying existing tagging practices POS to VOICE is a discovery procedure which raises awareness of the features of ELF that might otherwise not be apparent. Trying to categorise these features with more precision brings out more clearly the essentially creative, elusive and variable character of ELF as natural occurring language use.


[1] Note that this paper was written in 2010 when the POS tagging process was in its early stages. Since then, a POS-tagged version of VOICE has been published online and for download (cf.

[2] A good overview of ELF research can be found in Mauranen & Ranta (2009).

[3] More meta information can be found in the text headers [] as well as in the speaker information pop-up [].

[4] cf. what Widdowson (1997), Seidlhofer & Widdowson (2009) and Seidlhofer (2011) refer to as ‘virtual language’.

[5] This follows Firth’s (1996: 240) notion of English as a lingua franca, who defines it as as “a ‘contact language’ between persons who share neither a common native tongue nor a common (national) culture, and for whom English is the chosen foreign language of communication.”

[6] The OALD7 was chosen as a reference manual in the early stages of the compilation of VOICE for various reasons (for a detailed discussion, see Breiteneder et al. (2006: 179ff.) and Pitzl et al. 2008: 25). One of the main reason was that it served as “a stable and shared point of reference for practical reasons” (Breiteneder et al. 2006: 180), meaning that it was considered up-to-date at the time when VOICE was being compiled, and a CD-ROM of the OALD7 could be made available to all transcribers, ensuring spelling consistency (Breiteneder et al. 2006: 180). Moreover, OALD was chosen as its range of lexical entries was expected to correspond to the language usage of the speakers in VOICE (cf. Breiteneder et al. 2006: 180). It is important to note that it was not used authoritatively, but rather as a reference manual.

[7] cf. Pitzl et al. (2008: 22 ff.) for a more detailed account of the definition of the <pvc> tag and the guiding principles.

[8] LEcon565: ID of speech event; 378: Utterance number in the speech event; L1=ita-IT: First language of speaker, here: Italian, IT: corresponding country (abbreviation of languages according to the ISO 639-2 codes, abbreviation of corresponding countries according to the ISO 3166-1-alpha-2 codes).

[9] Data cited herein have been extracted from the British National Corpus Online service, managed by Oxford University Computing Services on behalf of the BNC Consortium. All rights in the texts cited are reserved.

[10] This is in line with the theoretical framework of some more recent approaches in SLA which view second language speakers as L2 users (e.g. Cook 2002), address the “monolingual bias” in SLA research (Ortega 2010) and call for investigating the language of these users in its own right. However, this theoretical view is not usually one that informs SLA studies, where a native speaker target is generally still implied.

[11] cf. Pravec (2002) and Granger (2008) for a comprehensive overview on learner corpora.

[12] The view expressed by Granger (2008: 260) that the foci of learner corpora and ELF corpora are merely “two sides of a coin” and that the speakers differentiated by level of proficiency, does not hold for ELF. In Granger’s understanding, learner corpora deal with those “speakers who are still in the process of learning the language” (Granger 2008: 260) whereas ELF corpora deal with “proficient non-native speakers of English” (Granger 2008: 260). However, if we understand ELF as communication between non-native speakers for real-life purposes or “to go about their normal business” as Granger (2008: 261) words it, this involves all levels of proficiency, i.e. a Spanish tourist with elementary knowledge of English ordering a coffee at Moscow airport, as well as Turkish and Swedish business people discussing complex financial issues. Disregarding their level of proficiency, all of these people will be users in the situations described above.

[13] cf. also Atwell & Elliott (1987) for an illustration of how the likelihood of tag probabilities can be used to detect errors in texts.

[14] Both studies by Volk & Schneider (1998) and Giesbrecht & Evert (2009) were conducted on German data.


ACE Project Website = Asian Corpus of English, 28 Dec. 2012.

BNC = British National Corpus, 28 Dec. 2012.

COCA = The Corpus of Contemporary American English, 1990–present. 28 Dec. 2012.

ELFA Project Page =  Corpus of English as a Lingua Franca in Academic Settings, 28 Dec. 2012.

GATE = General architecture for text engineering. 28 Dec. 2012.

TELF Project Website = Tübingen English as a Lingua Franca. 28 Dec. 2012.

TreeTagger = TreeTagger - a language independent part-of-speech tagger. 28 Dec. 2012.

VOICE 2011 = The Vienna-Oxford International Corpus of English (version 1.0 XML). 2011. Director: Barbara Seidlhofer; Researchers: Angelika Breiteneder, Theresa Klimpfinger, Stefan Majewski, Marie-Luise Pitzl.


Allauzen, Alexandre & Hélène Bonneau-Maynard. 2008. “Training and evaluation of pos taggers on the french multitag corpus“. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Marrakech, May 2008, ed. by European Language Resources Association (ELRA), 3373–3377.

Atwell, Eric. 2008. “Development of tag sets for part-of-speech tagging”. Corpus linguistics: An international handbook, ed. by Anke Lüdeling & Merja Kytö, 501–527. Berlin & New York: Mouton de Gruyer.

Atwell, Eric & Stephen Elliott. 1987. “Dealing with Ill-formed English Text”. The Computational Analysis of English: A Corpus-based Approach, ed. by Roger Garside, Geoffrey Sampson & Geoffrey Leech, 120–138. London: Longman.

Bjørge, Anne Kari. 2007. “Power distance in English lingua franca email communication“. International Journal of Applied Linguistics 17(1): 60–80.

Björkman, Beyza. 2009. “From Code to Discourse in Spoken ELF”. In Mauranen & Ranta (eds.), 225–251.

Breiteneder, Angelika. 2009. English as a lingua franca in Europe. A natural development. Saarbrücken: VDM-Verlag Müller.

Breiteneder, Angelika, Marie-Luise Pitzl, Stefan Majewski & Theresa Klimpfinger. 2006. “VOICE recording - Methodological challenges in the compilation of a corpus of spoken ELF”. Nordic Journal of English Studies 5(2): 161–188.

Breiteneder, Angelika, Theresa Klimpfinger, Stefan Majewski & Marie-Luise Pitzl. 2009. “The Vienna-Oxford International Corpus of English (VOICE). A linguistic resource for exploring English as a lingua franca”. ÖGAI Journal 28(1): 21–26.

Clayson-Knollmayr, Beate. 2010. “’Drop me an e-mail when draft is ready.’ Register and style in ELF business e-mails”. Paper presented at the Third International Conference of English as a Lingua Franca, Vienna, May 2010.

Cogo, Alessia. 2009. “Accommodating Difference in ELF Conversations: A Study of Pragmatic Strategies”. In Mauranen & Ranta (eds.), 254–273.

Cogo, Alessia & Martin Dewey. 2006. “Efficiency in ELF Communication: From Pragmatic Motives to Lexico-grammatical Innovation”. Nordic Journal of English Studies 5(2): 59–93.

Cook, Vivian. 2002. “Background to the L2 User”. Portraits of the L2 User, ed. by Vivian Cook, 1–28. Clevedon: Multilingual Matters LTD

Firth, Alan. 1996. “The discursive accomplishment of normality: on 'lingua franca' English and conversation analysis”. Journal of Pragmatics 26: 237–259.

Ehrenreich, Susanne. 2009. “English as a Lingua Franca in Multinational Corporations – Exploring Business Communities of Practice”. In Mauranen & Ranta (eds.), 126–151.

Giesbrecht, Eugenie & Stefan Evert. 2009. “Is Part-of-Speech Tagging a Solved Task? An Evaluation of POS Taggers for the German Web as Corpus”. Proceedings of the 5th Web as Corpus Workshop (WAC5), ed. by Iñaki Alegria, Igor Leturia & Serge Sharoff. San Sebastian: Spain.

Granger, Sylviane. 2008. “Learner corpora”. Corpus linguistics: An international handbook, ed. by Anke Lüdeling & Merja Kytö, 259–275. Berlin & New York: Mouton de Gruyer.

Gundacker, Julia. 2010. “Why ELF as the language of couples? – The advantages and limitations of ELF”. Paper presented at the Third International Conference of English as a Lingua Franca, Vienna, May 2010.

Jenkins, Jennifer. 2009. World Englishes. A resource book for students (2nd edition). London: Routledge.

Johansson, Stig. 2004. “Corpus linguistics - past, present, future: A view from Oslo”. English Corpora Under Japanese Eyes, ed. by Junsaku Nakamura, Nagayuki Inoue & Tomoji Tabata akamura, 3–24. Amsterdam: Rodopi.

Kachru, Braj. 1985. “Standards, Codification and Sociolinguistic Realism”. English in the World, ed. by R. Quirk, 11–34. Cambridge: Cambridge University Press.

Kankaanranta, Anne. 2005. “Hej, Seppo, could you pls comment on this!” - Internal Email Communication in Lingua Franca English in a Multinational Company. Jyväskylä: Jyväskylä University Printing House.

Klimpfinger, Theresa. 2009. “‘She’s mixing the two languages together’ – Forms and Functions of Code-Switching in English as a Lingua Franca”. In Mauranen & Ranta (eds.), 349–371.

Kloetzl , Svitlana. 2010. “A Love Affair with ELF: the case of linguistic hybridity in ELF couples talk”. Paper presented at the Third International Conference of English as a Lingua Franca, Vienna, May 2010.

Mauranen, Anna & Elina Ranta, eds. 2009. English as a Lingua Franca: Studies and Findings. Newcastle upon Tyne: Cambridge Scholars Publishing.

McArthur, Tom. 1998. The English languages. Cambridge: CUP.

Nøklestad, Anders & Åshild Søfteland. 2007. “Tagging a Norwegian Speech Corpus”. Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007), Tartu, 2007, ed. by Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek & Mare Koit, 245–248.

Ortega, Lourdes. 2010. “The bilingual turn in SLA”. Paper presented at the AAAL Conference, Atlanta, March 2010.

Osimk, Ruth. 2011. “Developing tagging practices for VOICE – the challenge of variation”. Poster Display presented at the 4th International Conference of English as a Lingua Franca, Hong Kong, May 2011.

Pitzl, Marie-Luise. 2005. “Non-understanding in English as a lingua franca: examples from a business context”. Vienna English Working PaperS 14(2): 50–71.

Pitzl, Marie-Luise. 2009. “Diverging from existing norms: Creativity in ELF”. Paper presented at 2nd International Conference of English as a Lingua Franca, Southampton, Great Britain, 7 April 2009.

Pitzl, Marie-Luise. 2010. English as a lingua franca in international business: Resolving miscommunication and reaching shared understanding. Saarbrücken: VDM.

Pitzl, Marie-Luise. 2011. Creativity in English as a lingua franca: Idiom and metaphor. Ph.D. dissertation, University of Vienna.

Pitzl, Marie-Luise, Angelika Breiteneder & Theresa Klimpfinger. 2008. “A world of words: processes of lexical innovation in VOICE”. Views 17: 21–46.

Pravec, Norma A. 2002. “Survey of learner corpora”. ICAME Journal 26: 81–114.

Pullin-Stark, Patricia. 2009. “No Joke – This is Serious! Power, Solidarity and Humour in Business English as a Lingua Franca (BELF)”. In Mauranen & Ranta (eds.), 152–177.

Ranta, Elina. 2009. “Syntactic Features in Spoken ELF—Learner Language or Spoken Grammar?” In Mauranen & Ranta (eds.), 84–106.

Rastelli, Stefano. 2009. “Learner corpora without error tagging”. Linguistik online 38: 57–66.

Schmid, Helmut. 1994. “Probabilistic part-of-speech tagging using decision trees”. Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK, 44–49.

Schmid, Helmut. 1999. “Improvements in Part-of-Speech Tagging with an Application to German”. Proceedings of the ACL SIGDAT-Workshop, March 1995.

Seidlhofer, Barbara. 2001. “Closing a conceptual gap: The case for a description of English as lingua franca”. International Journal of Applied Linguistics 11: 133–158.

Seidlhofer, Barbara. 2009a. “Orientations in ELF research: form and function”. In Mauranen & Ranta (eds.), 37–59.

Seidlhofer, Barbara. 2009b. “Common ground and different realities: World Englishes and English as a lingua franca”. World Englishes 28(2): 236–245.

Seidlhofer, Barbara. 2011. Understanding English as a Lingua Franca. Oxford: OUP.

Seidlhofer, Barbara & Henry Widdowson. 2009. “Conformity and creativity in ELF and learner Englishes”. Dimensionen der Zweitsprachenforschung. Dimensions of Second Language Research. Festschrift for Kurt Kohn, ed. by Michaela Albl-Mikasa, Sabine Braun & Sylvia Kalina, 93–107. Tübingen: Narr Verlag.

Smit, Ute. 2009. “Emic Evaluations and Interactive Processes in a Classroom Community of Practice”. In Mauranen & Ranta (eds.), 200–224.

Smit, Ute. 2010. English as a lingua franca in higher education: a longitudinal study of classroom discourse. Berlin: de Gruyter Mouton.

VOICE Project. 2007. “Mark-up conventions”. VOICE Transcription Conventions 2.1. Availability of VOICE:

VOICE Project website.

Volk, Martin & Gerold Schneider. 1998. “Comparing a statistical and a rule-based tagger for German”. Proceedings of KONVENS-98, 1–4.

Widdowson, Henry. 1997. “EIL, ESL, EFL: global issues and local interests”. World Englishes 16: 135–146.