SciELF corpus

The SciELF corpus consists of research papers that have not undergone professional proofreading services or checking by a native speaker of English.

All the papers are written by L2 users of English, and most of these are final drafts of unpublished manuscripts. It is thus a corpus of second-language use (SLU) in written scientific communication. Several have contributed material to this corpus, resulting in 150 papers (759,300 words) by authors with ten different L1 backgrounds. The breakdown of these L1s is as follows:

The ten L1 categories in the SciELF corpus:
	first author's L1	no. of articles	no. of words	% of words	avg. words/article
1	Finnish	25	123153	16%	4926
2	Czech	22	109173	14%	4962
3	French	16	91186	12%	5699
4	Chinese	21	84807	11%	4038
5	Spanish	13	79038	10%	6080
6	Russian	13	71376	9%	5490
7	Swedish	13	60060	8%	4620
8	Italian	11	58685	8%	5335
9	Portuguese (Brazil)	12	56625	7%	4719
10	Romanian	4	25197	3%	6299
		150	759300	100%	5062

In addition, we attempted to compile a balanced sample of papers between the sciences (labelled ‘Sci’) and the social sciences and humanities (labelled ‘SSH’). However, the texts categorised as SSH were found to be much longer on average than those labelled Sci, so the broad division of the corpus appears thus:

Distribution of the broad binary categories in the SciELF corpus:
category	no. of articles	no. of words	% of total words	avg. words/article
Sci	78	326463	43%	4185
SSH	72	432837	57%	6012
	150	759300	100%	5062

Among the 326,463 words in the Sci category, most are drawn from the natural sciences (79%) and medicine (18%). The 432,837 words in SSH are drawn from social sciences (45%), humanities (34%), and behavioural sciences (21%). As for the academic roles of the first authors, the distribution of these various roles in SciELF is as follows:

first author role	no. of articles	no. of words	% of words
Junior staff	86	418366	55%
Senior staff	34	172075	23%
Research student	17	107998	14%
Unknown	11	41116	5%
Masters student	2	19745	3%
	150	759300	100%

The SciELF corpus would not have been possible without the generous contribution of our international partners, who obtained texts and author permissions in their respective home countries. We gratefully acknowledge the contribution of the following researchers:

Marina Bondi and Anna Stermieri, University of Modena and Reggio Emilia
Maria Kuteeva and Lisa McGrath, University of Stockholm
Pilar Mur-Dueñas, University of Zaragoza
Laura Muresan and Mirela Bardi, Bucharest University of Economic Studies
Lene Nordrum, Lund University
Wei Ren, Guangdong University of Foreign Studies
Elizabeth Rowley-Jolivet, Université d’Orléans
Tony Berber Sardinha, Catholic University of São Paulo
Irina Shchemeleva, St. Petersburg Higher School of Economics
Renáta Tomášková, University of Ostrava
Ying Wang, China Three Gorges University

Suggested citation

SciELF 2015. The SciELF Corpus. Director: Anna Mauranen. Compilation manager: Ray Carey. (last access).