The ELFA Project
- Home
- ELFA project description
- > ELFA corpus
- > SELF
- > WrELFA
- Publications
- Conference presentations
- Related sites
- Contact
- 1st ELF Forum 2008
Updated 9 January 2012
ELFA corpus
On this page you can find:
- Description of the ELFA corpus project
- Compilers
- Information on current research in the project
- Documents related to the data collection and processing
- University of Tampere ELFA corpus site
- How to obtain the corpus NEW!
Description of the ELFA corpus project
The ELFA corpus is now completed. Altogether, the corpus contains 1 million words of transcribed spoken academic ELF (approximately 131 hours of recorded speech). The data consists of both recordings and their transcripts, which will be available to researchers on request. The recordings were made at the University of Tampere, the University of Helsinki, Tampere University of Technology, and Helsinki University of Technology.
The speech events in the corpus include both monologic events, such as lectures and presentations (33 % of data), and dialogic/polylogic events, such as seminars, thesis defences, and conference discussions, which have been given an emphasis in the data (67%).
As for the disciplinary domains , the ELFA corpus is composed of social sciences (29% of the recorded data), technology (19%), humanities (17%), natural sciences (13%), medicine (10%), behavioural sciences (7%), and economics and administration (5%).
Also the speakers in ELFA represent a wide range of first language backgrounds as the data comprises approximately 650 speakers with 51 different first languages ranging from African languages (e.g. Akan, Dagbani, Igbo, Kikuyu, Somali, Swahili), to Asian (e.g. Arabic, Bengali, Chinese, Hindi, Japanese, Persian, Turkish, Uzbek), and European languages (e.g. Czech, Danish, Dutch, French, German, Italian, Lithuanian, Polish, Portuguese, Russian, Romanian, Swedish etc.).The percentage of speech by native English speakers is 5%. Also, considering that the recordings were made in Finnish speaking universities, the percentage of speech by Finnish mother tongue speakers is relatively low at 28.5%.
As a general principle, all data in the corpus is authentic in the sense that it is not elicited for research purposes but occurs naturally. It consists of complete speech events, i.e. complete individual sessions. Native speakers of English are excluded when possible, but if they are present in groups, this is coded. Sessions with speakers who all share an L1 are not included, neither are English language courses.
Compilation criteria are external , that is, they are not determined on the basis of linguistic register features, but by socially-based definitions of the prominent genres of the discourse community. The basic unit of sampling isspeech event type . This is a looser concept than genre, and therefore likely to be more appropriate, as the discourses represent a variety of events, some of which are much further established as genres (e.g. lectures) than others (e.g. workshops).
Two fundamental selection criteria are genre/event type and discipline, whose primary categorisation is made by the labels and descriptions given by the relevant discourse community ( folk genres ), for instance lecture in political history or seminar in economics. Other external criteria involve institutional hierarchies, which affect the speakers' interpersonal relations: peer sessions (student groups, conference presentations), vs. groups mixed with respect to academic status (lectures, seminars, other sessions with teacher + students) are included, with emphasis on the asymmetrical event types since they dominate the discourse community.
Most recordings are single events, but some seek to observe the chain-like nature of academic courses, to track changes in group cohesion and familiarity effects; for instance, we can expect more initial facework in a new group than in seminar or lecture sessions from later stages when degrees of familiarity/formality have been negotiated.
The main selection criteria for event types are related to their perceived importance in one way or another:
- prototypicality, or the extent to which genres are shared and named by most disciplines, for example lectures, seminars, thesis defences, conference presentations.
- influence: genres that affect a large number of participants, for example introductory lecture courses, plenary lectures
- prestige: genres with high status in the discourse community, for example guest lectures, plenary lectures at conferences.
The selection of disciplines has been limited in part by their availability in the universities concerned, and in part by practical considerations: a broad division, based on disciplinary domain suits the size of the corpus as well as most research questions likely to arise at this stage.
The only language-internal classification applied has been the distinction between monologic and dialogic speech. Both types are included, with an emphasis on dialogic events.
Finally, more information about the speakers have also been recorded and made available, such as age group, gender, nationality and mother tongue.
The ELFA corpus project was funded by the Academy of Finland 2004–2007.
Compilers
Project leader: Anna Mauranen (University of Helsinki)
Team members / researchers: Elina Ranta (University of Tampere), Maria Metsä-Ketelä (University of Tampere)
Current assistants: Ray Carey (University of Helsinki), Niina Hynninen (University of Helsinki)
Former student assistants: Mari Sihvonen (University of Tampere), Pirjo Surakka-Cooper (University of Helsinki), along with a number of short-term student assistants
Current research
The project director, Anna Mauranen, has written extensively on the topic. In addition, the following PhD and MA students currently work on the ELFA corpus.
PhD students
Maria Metsä-Ketelä explores vague expressions in ELFA.
Elina Ranta's research focuses on universal verb-syntactic features in spoken ELF.
MA students
Kirsi Turunen looks into code-switching in the ELFA corpus.
* * *
Publications
Publications related to the ELFA corpus project can be found under Publications.
Documents
- ELFA corpus consent form
- ELFA corpus speaker information form
- ELFA corpus transcription guide
- Example of transcribed data
University of Tampere ELFA corpus site
Please visit the following external site hosted by the University of Tampere to learn more about the ELFA corpus project:
http://www.uta.fi/ltl/en/english/research/projects/elfa.html
How to obtain the corpus
The ELFA Text Corpus is now available for research!
The ELFA Text Corpus CD-ROM can now be subscribed at the moderate fee of 100 EUR per individual licence. The licence is for non-commercial use and it is valid for 6 months at a time. The text transcripts consist of the entire ELFA Corpus of approximately 1 million transcribed words, formatted as 165 files (in DOC and TXT form). ELFA files converted to XML are available on special request.
In order for you to obtain the ELFA Text Corpus CD-ROM and the licence, please contact anna.mauranen(at)helsinki.fi or niina.hynninen(at)helsinki.fi for further instructions.
We are currently negotiating with the data storage services in order to publish an online version of the ELFA Text Corpus, and possibly to provide the corpus for downloading.
Students at the Department of Modern Languages are encouraged to contact niina.hynninen(at)helsinki.fi should they wish to use the ELFA Text Corpus for their theses or course work.
News
- The compilation of a database of written academic ELF (WrELFA) has started! Read more here!
- The ELFA Text Corpus is now available for research! Read more here!
- Check out Journal of Pragmatics (2011, Vol. 43, No. 4) for recent research on ELF
- Now published: Helsinki English Studies Special Issue on English as a Lingua Franca!
- New project on Global English (GlobE) continues the ELF theme
- Check out English for Specific Purposes (2010, Vol. 29, No. 3) for recent research on ELF
- ELF Forum 2008 thematic volume published: English as a Lingua Franca: Studies and Findings edited by Mauranen A. & Ranta E. → go to publisher's site