WrELFA corpus

The Written ELF in Academic Settings (WrELFA) project is a new opening in the ELF research field. The WrELFA corpus of written academic ELF was completed in 2015, drawn from academic genres including the institutional (PhD examiner reports), professional (unedited research papers) and digital media (research blogs).

Writing and publishing in English is all-important in the making of academic careers. Many academics for whom English is not the first language worry about their language – often about its correctness and 'native-likeness'. Yet the majority of the readers of academic research publications are not native speakers of English. Effective academic text nevertheless hinges more on the quality of its contents, the strength of its argument, and the coherence of its rhetorical organisation than the details of correctness relative to Standard English. Reader responses to such matters are culturally variable, as was shown in Contrastive Rhetoric research in the 1990s.

The current world of academic writing and publishing is far more globalised than it was a decade or two ago. Yet we have no research evidence on the determinants of effectiveness in academic rhetoric in a world that is permeated by English as a lingua franca, and a constant flow of cultural influences from a variety of sources. Speed in publishing findings has also become a major issue in academia. New ways of making findings public are developing in a variety of online forms, which know no national or local boundaries.

Project WrELFA collects and analyses academic texts written in English as a lingua franca. The texts cover high-stakes genres in different fields, both published and unpublished. Among our target text types are evaluative reports, such as examiners' and peer reviewers' reports, and digital media such as research blogs. Our aim is to cover academic writing practices on a broad scale, ranging from texts circulated within academia to texts that reach the wider public.

Project director: Professor Anna Mauranen 
 

Corpus compilation began in late 2011 and was completed in early 2015. The WrELFA corpus consists of 1.5 million words drawn from three academic text types – unedited research papers (SciELF corpus, 759k words, 50% of total), PhD examiner reports (402k words, 26%), and research blogs (372k words, 24%). The target author is the academic user of English as a lingua franca (ELF), and texts are not to have undergone professional proofreading or checking by an English native speaker. It is thus a corpus of second-language use (SLU) in written scientific communication.

The corpus has been designed as a written complement to the spoken ELFA corpus with similar markup and metadata. WrELFA employs a broad binary categorisation of texts into the sciences (category “Sci”) and disciplines in social sciences and humanities (category “SSH”). Overall, the distribution of these categories is as follows:

Distribution of the broad binary categories in Wrelfa:
category no. of words % of words
Sci 840 395 55%
SSH 692 933 45%
  1 533 328 100%

Among the Sci texts, natural sciences are the best represented (63% of words) followed by medicine (22%) and agriculture & forestry (11%). The SSH texts are divided between social sciences (44%), humanities (36%) and behavioural sciences (18%).

Concerning first languages of the authors, at least 35 unique L1s are represented in the corpus (see a complete list of the L1s represented below), along with an undetermined number of blog commenters whose identities cannot be verified. As with other ELF corpora, English native speakers are included within the texts, in this case among the blog commenters and PhD examiners. Finnish is the largest L1, but with only 14% of total words, and the top 10 L1 categories (including unidentified blog commenters) make up 76% of the corpus:

The ten largest L1 categories in WrELFA:
  author L1 no. of words % of words
1 Finnish 210 328 14%
2 Czech 148 880 10%
3 English 115 472 8%
4 French 112 568 7%
5 Spanish 111 636 7%
6 Italian 107 739 7%
7 Swedish 94 632 6%
8 Chinese 90 916 6%
9 (blog commenters) 85 666 6%
10 Russian 82 120 5%
  total 1 159 957 76%
  other L1s 373 371 24%
  WrELFA total 1 533 328 100%

Finally, the authors in the corpus represent different stages of an academic career. Following the ELFA corpus categories, we distinguish between research students (completed master’s degree, but not yet a PhD), junior staff (post-doctoral researchers and early career academics) and senior staff (professors and senior scholars). In WrELFA as a whole, junior staff are best represented with 42% of total words. Senior staff contribute 30% of words, followed by research students with 11%. The remaining 17% include unknown roles (including blog commenters) and bloggers or PhD examiners who are employed outside the university sector.

Click on the following headings for more detailed information about each component of WrELFA:

  • SciELF corpus – a stand-alone subcorpus of research papers that have not undergone professional proofreading services or checking by a native speaker of English.
  • PhD examiner reports for submitted doctoral theses were collected from six faculties in the University of Helsinki over a two-year period.
  • Academic research blogging – consists of a sample of posts and discussions from 40 different research blogs, all of which are maintained by L2 users of English.

If you want to acquire the WrELFA corpus, please send an e-mail with a very brief description of the intended use to our research assistant Nina Mikusova.

When citing the WrELFA corpus in publications, we recommend the following citation:

WrELFA 2015. The Corpus of Written English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. Compilation manager: Ray Carey. http://www.helsinki.fi/elfa (last access).

Our first and foremost thanks are due to the authors of the examiner reports and SciELF papers, who generously gave us permission to use their texts as research materials. We also thank the deans of the six faculties at the University of Helsinki who helped us obtain the examiner reports from the Faculty Offices. Thanks are also due to the Language Services at the University of Helsinki for their help in finding articles at their pre-language revision stage. The contribution of our international partners to the SciELF corpus is also gratefully acknowledged.

The WrELFA corpus was financed by the GlobE Helsinki project and the ChangE Helsinki project, both of which have been funded by the Academy of Finland. Special thanks go to research assistants Ruut Kosonen and Jani Ahtiainen, who made a major contribution to the data collection and processing of texts and ensuring the quality of the corpus.

The WrELFA corpus includes more than 500 unique authors representing at least 37 first languages. The 17 most-represented L1 categories (i.e. those with at least 10,000 words) make up 95% of words in the corpus and are listed below. These figures include the large group of unidentified commenters on research blogs.

author L1 words %
Finnish 210 328 14%
Czech 148 880 10%
English 115 472 8%
French 112 568 7%
Spanish 111 636 7%
Italian 107 739 7%
Swedish 94 632 6%
Chinese 90 916 6%
(blog commenters) 85 666 6%
Russian 82 120 5%
Dutch 64 174 4%
German 62 774 4%
Portuguese (Brazil) 62 048 4%
Romanian 32 985 2%
Norwegian 30 830 2%
Bengali 23 957 2%
Danish 14 145 1%
  1 450 870 95%
other L1s: 82 458 5%
corpus total: 1 533 328 100%

The "other L1" category above includes bi-/multilingual authors as well as the following first languages:

  • Afrikaans
  • Arabic
  • Cantonese
  • Estonian
  • Filipino
  • Flemish
  • Georgian
  • Greek
  • Hebrew
  • Hindi
  • Hungarian
  • Japanese
  • Marathi
  • Neo-Aramaic
  • Polish
  • Serbian
  • Serbo-Croatian
  • Tamil
  • Turkish
  • Urdu
  • Vietnamese

Ethical considerations

Each of the three text types raise their own ethical considerations. Insofar as blogs are public texts with varying copyright restrictions, authors are identified and the blogs are fully cited in the file headers to honour the author’s ownership of the content. Bloggers were not contacted for permission to use their texts, as they are fully cited according to academic good practice. The source of the blog text is referenced and linked in each of the corpus headers. As these are fully attributed, public texts, bloggers’ names are not anonymised in the comment sections of the blogs.

In the case of blog comments, guest commenters are identified by an online handle which is sometimes (presumably) the commenter’s full name and at other times fully anonymous. In any case, all references to the commenter’s handle have been replaced with a reference code (C1, C2, etc.) both in the running text and in the comment headers. However, as this information is freely available online, the handles corresponding to each commenter code are listed in quotes in the file headers. In this way, running text that might be used as linguistic examples contain an anonymous code, with the original handle referenced indirectly.

Turning to the PhD examiner reports, these are part of the public examination process and in themselves are public documents. However, they are not widely circulated and examiners would not normally expect that these sometimes unpolished texts would be of interest outside their practical function. For this reason, examiners were contacted via email for permission to use their texts and for L1 self-reporting. In cases where two examiners co-authored a report, both were contacted for permission and the text was included if both agreed.

Similarly in the case of SciELF, permission and L1 self-reporting was sought from the first and/or corresponding author of the paper. While first author and corresponding author may not always coincide, our intent was to obtain permission from the researcher who was primarily responsible for the content. The SciELF texts with Finnish L1 authors were obtained through University of Helsinki Language Services, where employees of the university can submit their texts for professional language revision. All other texts and the accompanying permissions were obtained by our international partners and forwarded to Helsinki for processing.

In addition to obtaining these author permissions, examiner reports and SciELF articles have been thoroughly anonymised. We have tried to ensure that authors’ identities and affiliations cannot be gleaned from the texts themselves. Anonymisation tags replace the names of all authors and co-authors; names of examiners and candidates; names of supervisors and close collaborators; all titles of identifying publications; institutional affiliations and locations; and identifying bibliographic entries. In many cases, the names of research projects, funding bodies, or proprietary software have also been anonymised. In the case of SciELF articles, all personal names have been anonymised from Acknowledgement sections, and all references to co-authors and their bibliographic entries are also anonymised.

Criteria for inclusion in the corpus

Text collection began in late 2011 with Jan. 2011 as the focal starting date for texts. Blog posts were collected from the start of Jan. 2011, though texts from 2010 and 2012 are included when the number of suitable posts from a particular blog in 2011 were insufficient. The primary criteria for inclusion was that the author is neither a native speaker of English nor based professionally in an L1-English country. In addition, we did not include science journalism, which is mostly written by professional journalists, whether in digital platforms or not. Thus, all 40 bloggers are independent academics who blog about their own fields of interest and are based outside of an “Inner Circle” country.

Most of the included blogs were identified through researchblogging.org, a research blog aggregator, and the blogger’s L1 was typically determined through open online resources (“About Me”, CVs, LinkedIn, etc.). Bloggers were contacted when their first language status was in doubt. Only posts were considered which were related to the blogger’s research field (especially those dealing with published research), including conference reviews, discussions on professional life, and “metablogging” about research blogging itself. This was intended to exclude blog posts on non-professional interests or personal hobbies. A preference was made for longer, more developed posts and those which included comments and discussion.

In selecting individual blog posts, a rough target was set of 5000 to 8000 words from each blogger, with no more than eight posts per blogger included (it was found that if 5000 words could not be reached with eight posts, the texts were not long enough to warrant additional processing time). As a general rule, eight complete posts were collected from each of the blogs with an average word count of 7,600 words from each blog.

The PhD examiner reports were acquired as scanned pdfs of the original documents. After conversion to XML, a careful side-by-side checking with the original pdf was performed to ensure the accuracy of the XML texts. As this is a virtually unexplored genre of academic writing, it was decided to include examiner reports written by native speakers of English (as well as L2 users based in L1 English countries).

Finally, the SciELF texts were collected based on two main requirements: the author(s) should not have English as an L1, and the text should not have undergone professional proofreading services or language checking by an English native speaker. The large majority of these texts were obtained as drafts in a word processor format. In a few cases, we accepted published articles under the condition that the authors could verify that language revisions had not taken place as a condition for publication.

All files were grouped into a rough binary categorization of the sciences (Sci) and social sciences & humanities (SSH). This categorisation is by no means unproblematic, but it tends to work best for the big picture, and a more fine-grained division would not be justified for a corpus of this size. In some disciplines, this categorization was not always obvious. In particular, there are 24 SciELF articles dealing with economics, which is the best represented discipline in SciELF with 108,552 words. In consultation with our partners, we classified these as Sci when the texts dealt with e.g. statistical modelling, big data, or heavily mathematical methodologies (n=10). On the other hand, economics texts that relied mainly on interviews, questionnaires, and more qualitative methods were classed as SSH (n=14).

On the more technical side, the corpus is being compiled directly into TEI-compliant XML, with programming scripts for converting the master database into a minimally annotated .txt format (plain text) for use in concordance software, as well as an .rtf version (Rich Text Format) for word processing software. Unlike the plain text version, the .rtf output preserves text formatting (e.g. bold, italic) and active hyperlinks from blogs.