The ELFA project

WrELFA corpus compilation principles

Ethical considerations

Each of the three text types raise their own ethical considerations. Insofar as blogs are public texts with varying copyright restrictions, authors are identified and the blogs are fully cited in the file headers to honour the author’s ownership of the content. Bloggers were not contacted for permission to use their texts, as they are fully cited according to academic good practice. The source of the blog text is referenced and linked in each of the corpus headers. As these are fully attributed, public texts, bloggers’ names are not anonymised in the comment sections of the blogs.

In the case of blog comments, guest commenters are identified by an online handle which is sometimes (presumably) the commenter’s full name and at other times fully anonymous. In any case, all references to the commenter’s handle have been replaced with a reference code (C1, C2, etc.) both in the running text and in the comment headers. However, as this information is freely available online, the handles corresponding to each commenter code are listed in quotes in the file headers. In this way, running text that might be used as linguistic examples contain an anonymous code, with the original handle referenced indirectly.

Turning to the PhD examiner reports, these are part of the public examination process and in themselves are public documents. However, they are not widely circulated and examiners would not normally expect that these sometimes unpolished texts would be of interest outside their practical function. For this reason, examiners were contacted via email for permission to use their texts and for L1 self-reporting. In cases where two examiners co-authored a report, both were contacted for permission and the text was included if both agreed.

Similarly in the case of SciELF, permission and L1 self-reporting was sought from the first and/or corresponding author of the paper. While first author and corresponding author may not always coincide, our intent was to obtain permission from the researcher who was primarily responsible for the content. The SciELF texts with Finnish L1 authors were obtained through University of Helsinki Language Services, where employees of the university can submit their texts for professional language revision. All other texts and the accompanying permissions were obtained by our international partners and forwarded to Helsinki for processing.

In addition to obtaining these author permissions, examiner reports and SciELF articles have been thoroughly anonymised. We have tried to ensure that authors’ identities and affiliations cannot be gleaned from the texts themselves. Anonymisation tags replace the names of all authors and co-authors; names of examiners and candidates; names of supervisors and close collaborators; all titles of identifying publications; institutional affiliations and locations; and identifying bibliographic entries. In many cases, the names of research projects, funding bodies, or proprietary software have also been anonymised. In the case of SciELF articles, all personal names have been anonymised from Acknowledgement sections, and all references to co-authors and their bibliographic entries are also anonymised.

Criteria for inclusion in the corpus

Text collection began in late 2011 with Jan. 2011 as the focal starting date for texts. Blog posts were collected from the start of Jan. 2011, though texts from 2010 and 2012 are included when the number of suitable posts from a particular blog in 2011 were insufficient. The primary criteria for inclusion was that the author is neither a native speaker of English nor based professionally in an L1-English country. In addition, we did not include science journalism, which is mostly written by professional journalists, whether in digital platforms or not. Thus, all 40 bloggers are independent academics who blog about their own fields of interest and are based outside of an “Inner Circle” country.

Most of the included blogs were identified through, a research blog aggregator, and the blogger’s L1 was typically determined through open online resources (“About Me”, CVs, LinkedIn, etc.). Bloggers were contacted when their first language status was in doubt. Only posts were considered which were related to the blogger’s research field (especially those dealing with published research), including conference reviews, discussions on professional life, and “metablogging” about research blogging itself. This was intended to exclude blog posts on non-professional interests or personal hobbies. A preference was made for longer, more developed posts and those which included comments and discussion.

In selecting individual blog posts, a rough target was set of 5000 to 8000 words from each blogger, with no more than eight posts per blogger included (it was found that if 5000 words could not be reached with eight posts, the texts were not long enough to warrant additional processing time). As a general rule, eight complete posts were collected from each of the blogs with an average word count of 7,600 words from each blog.

The PhD examiner reports were acquired as scanned pdfs of the original documents. After conversion to XML, a careful side-by-side checking with the original pdf was performed to ensure the accuracy of the XML texts. As this is a virtually unexplored genre of academic writing, it was decided to include examiner reports written by native speakers of English (as well as L2 users based in L1 English countries).

Finally, the SciELF texts were collected based on two main requirements: the author(s) should not have English as an L1, and the text should not have undergone professional proofreading services or language checking by an English native speaker. The large majority of these texts were obtained as drafts in a word processor format. In a few cases, we accepted published articles under the condition that the authors could verify that language revisions had not taken place as a condition for publication.

All files were grouped into a rough binary categorization of the sciences (Sci) and social sciences & humanities (SSH). This categorisation is by no means unproblematic, but it tends to work best for the big picture, and a more fine-grained division would not be justified for a corpus of this size. In some disciplines, this categorization was not always obvious. In particular, there are 24 SciELF articles dealing with economics, which is the best represented discipline in SciELF with 108,552 words. In consultation with our partners, we classified these as Sci when the texts dealt with e.g. statistical modelling, big data, or heavily mathematical methodologies (n=10). On the other hand, economics texts that relied mainly on interviews, questionnaires, and more qualitative methods were classed as SSH (n=14).

On the more technical side, the corpus is being compiled directly into TEI-compliant XML, with programming scripts for converting the master database into a minimally annotated .txt format (plain text) for use in concordance software, as well as an .rtf version (Rich Text Format) for word processing software. Unlike the plain text version, the .rtf output preserves text formatting (e.g. bold, italic) and active hyperlinks from blogs.


  • For research blogging on ELF, see the ELFA project blog.
  • Anna Mauranen has published a chapter on academic ELF in New Frontiers in Teaching and Learning English, edited by Paola Vettorel (Cambridge Scholars).
  • An intensive course on ELF is offered by researchers from the ELFA project in the Helsinki Summer School, Aug. 4–20, 2015. For description of the course, see the ELFA blog.
  • Niina Hynninen has published an article in the Journal of English as a Lingua Franca 3(2) entitled "The Common European Framework of Reference from the perspective of English as a lingua franca: What we can learn from a focus on language regulation".
  • Svetlana Vetchinnikova has defended her PhD thesis, Second language lexis and the idiom principle. Read the abstract and download the full text from Helsinki's E-thesis service.
  • Maria Kuteeva & Anna Mauranen have edited a special issue of the Journal of English for Academic Purposes 13: Writing for publication in multilingual contexts. Find their introduction here.
  • Kaisa Pietikäinen has published an article entitled ELF couples and automatic code-switching in the Journal of English as a Lingua Franca 3(1).