The ELFA corpus is freely available to researchers in both plain text and XML formats.
The ELFA corpus was completed in 2008 and its development work is ongoing. Altogether, the corpus contains 1 million words of transcribed spoken academic ELF (approximately 131 hours of recorded speech). The data consists of both recordings and their transcripts, which are freely available to researchers. The recordings were made at the University of Tampere, the University of Helsinki, Tampere University of Technology, and Helsinki University of Technology.
The speech events in the corpus include both monologic events, such as lectures and presentations (33% of data), and dialogic/polylogic events, such as seminars, thesis defences, and conference discussions, which have been given an emphasis in the data (67%).
As for the disciplinary domains, the ELFA corpus is composed of social sciences (29% of the recorded data), technology (19%), humanities (17%), natural sciences (13%), medicine (10%), behavioural sciences (7%), and economics and administration (5%) (see also Mauranen, Hynninen & Ranta (2010) English as an academic lingua franca: The ELFA project. English for Specific Purposes, 29. 183-190).
Also the speakers in ELFA represent a wide range of first language backgrounds as the data comprises approximately 650 speakers with 51 different first languages from several continents (see the complete list here). The percentage of speech by native English speakers is 5%. Also, considering that the recordings were made in Finnish speaking universities, the percentage of speech by Finnish mother tongue speakers is relatively low at 28.5%.
As a general principle, all data in the corpus is authentic in the sense that it is not elicited for research purposes but occurs naturally. It consists of complete speech events, i.e. complete individual sessions. Long monologues (e.g. conference presentations or course lectures) by native speakers of English are not transcribed, but if English native speakers are present in groups, this is coded. Sessions with speakers who all share an L1 are not included, neither are English language courses.
Compilation criteria are external, that is, they are not determined on the basis of linguistic register features, but by socially-based definitions of the prominent genres of the discourse community. The basic unit of sampling is speech event type. This is a looser concept than genre, and therefore likely to be more appropriate, as the discourses represent a variety of events, some of which are much further established as genres (e.g. lectures) than others (e.g. workshops).
Two fundamental selection criteria are genre/event type and discipline, whose primary categorisation is made by the labels and descriptions given by the relevant discourse community (folk genres), for instance lecture in political history or seminar in economics. Other external criteria involve institutional hierarchies, which affect the speakers' interpersonal relations: peer sessions (student groups, conference presentations), vs. groups mixed with respect to academic status (lectures, seminars, other sessions with teacher + students) are included, with emphasis on the asymmetrical event types since they dominate the discourse community.
Most recordings are single events, but some seek to observe the chain-like nature of academic courses, to track changes in group cohesion and familiarity effects; for instance, we can expect more initial facework in a new group than in seminar or lecture sessions from later stages when degrees of familiarity/formality have been negotiated.
The main selection criteria for event types are related to their perceived importance in one way or another:
The selection of disciplines has been limited in part by their availability in the universities concerned, and in part by practical considerations: a broad division based on disciplinary domain suits the size of the corpus as well as most research questions likely to arise at this stage.
The only language-internal classification applied has been the distinction between monologic and dialogic speech. Both types are included, with an emphasis on dialogic events.
Finally, more information about the speakers have also been included in file headers, such as age group, gender, nationality and mother tongue.
The ELFA corpus project was funded by the Academy of Finland from 2004–2007.
The ELFA corpus is available in Kielipankki - the Language Bank of Finland. See CSC's ELFA corpus page for more information about the corpus and how to apply for access rights if needed (in short: the text corpus files (transcriptions) are available for download directly on Kielipankki's download page under the CC BY License; the audio data is available under restricted access and you can find more about applying for access rights at Kielipankki's how-to page).
When citing the ELFA corpus in publications, we recommend the following citation:
ELFA 2008. The Corpus of English as a Lingua Franca in Academic Settings. Director: Anna Mauranen. http://www.helsinki.fi/elfa (date of last access).
Project leader: Anna Mauranen (University of Helsinki)
Team members / researchers: Elina Ranta (University of Tampere), Maria Metsä-Ketelä (University of Tampere)
Current assistants: Svetlana Vetchinnikova (University of Helsinki)
Former assistants: Mari Sihvonen (University of Tampere), Pirjo Surakka-Cooper (University of Helsinki), Niina Hynninen (University of Helsinki), Ray Carey (University of Helsinki), along with a number of short-term student assistants
The ELFA corpus includes roughly 650 speakers representing 51 first languages. The distribution of tokens among speakers of various first languages is as follows:
Languages | Tokens | % of tokens |
---|---|---|
Finnish | 301632 | 28.5 |
German | 85996 | 8.1 |
Russian | 69905 | 6.6 |
Swedish | 67485 | 6.4 |
Dutch | 58823 | 5.6 |
English | 53609 | 5.1 |
Danish | 39957 | 3.8 |
French | 37918 | 3.6 |
Italian | 31124 | 2.9 |
Romanian | 21420 | 2.0 |
Spanish | 20984 | 2.0 |
Portuguese | 19533 | 1.8 |
Polish | 19134 | 1.8 |
Lithuanian | 18215 | 1.7 |
Norwegian | 14984 | 1.4 |
Catalan | 14512 | 1.4 |
Bengali | 13722 | 1.3 |
Croatian | 13674 | 1.3 |
Czech | 13384 | 1.3 |
Akan/Twi | 12515 | 1.2 |
Somali | 12194 | 1.2 |
unknown | 11779 | 1.1 |
Swahili | 10910 | 1.0 |
Dagbani | 10237 | 1.0 |
Arabic | 9243 | 0.9 |
Persian/Farsi | 9242 | 0.9 |
Hindi | 8299 | 0.8 |
Chinese/Cantonese | 7667 | 0.7 |
Japanese | 6720 | 0.6 |
Kikuyu | 6324 | 0.6 |
Bulgarian | 5459 | 0.5 |
Hungarian | 4053 | 0.4 |
Estonian | 3193 | 0.3 |
Igbo | 3150 | 0.3 |
Greek | 2486 | 0.2 |
Dangme | 2364 | 0.2 |
Kihaya | 1936 | 0.2 |
Urdu | 1846 | 0.2 |
Uzbek | 1726 | 0.2 |
Nepali | 1705 | 0.2 |
Turkish | 1590 | 0.2 |
Efilo | 989 | 0.1 |
Yoruba | 989 | 0.1 |
Hausa | 989 | 0.1 |
Oromo | 940 | 0.1 |
Amharic | 749 | 0.07 |
Latvian | 666 | 0.06 |
Slovakian | 548 | 0.05 |
Icelandic | 337 | 0.03 |
Hebrew | 260 | 0.02 |
Berber | 113 | 0.01 |
Welsh | 102 | 0.01 |
The above figures were derived from the XML version of the ELFA corpus. Word tokens have been counted independent of the header metadata and all XML mark-up, with the exception of anonymised names, which have been counted as tokens. When a speaker has reported more than one first language, that speaker's tokens have been counted under each of those languages. Thus, the total number of tokens presented here are greater than in the corpus itself.
The proportion of speech by Finnish native speakers was kept to 28.5%. The proportion of native/bilingual English speakers amounts to 5.1% of speech in the corpus. Among English speakers, several regional varieties of English are represented:
Here you can find important documents related to the ELFA corpus.