NEST – a corpus in the brooding box

Anne-Line Graedler
Hedmark University College, Norway


This paper describes the design and compilation of data for the Norwegian-English Student Translation corpus (NEST). Still in the beginning stages, the prospective corpus will contain translations from Norwegian into English produced by students of English at Norwegian colleges and universities. A brief discussion of learner translation corpora is followed by an outline of the principles and procedures applied in the collection of texts, the contributing students, and the source texts for translation. Some samples of data from the collection of student translations are given as an illustration and indication of possible future research applications.

1. The conception of the project

Traditionally, translation into the foreign language has been a component of advanced language study programs in many countries (see e.g. Rydning 1994). In Norway, several universities and colleges offer translation into English as part of their portfolio for advanced students. The idea of creating an electronic corpus of student translations was conceived during the teaching of a translation course at the University of Oslo, the aim of which was [1]

to improve the students’ proficiency in English, to raise their awareness of important differences between Norwegian and English usage, to present and discuss some central translation problems, and to improve the students’ competence as translators (ENG1102 - Translation and Practical Exercises, An Introduction).

This paper describes the initial stages of the design and compilation of the Norwegian-English Student Translation corpus (NEST), a project still in progress. Section 2 briefly discusses the place of learner translation corpora in the broader context of learner corpora. In section 3, some considerations relevant to the data collection are outlined, and section 4 contains a description of the present situation with regard to contributing students and collected texts. At the time of writing, the interface to search the corpus is not yet in place. For this reason, any comprehensive corpus linguistic analysis of the data gathered so far, as well as any meaningful comparison with data from other corpora, is barely feasible at this stage. However, as an illustration, section 5 presents some samples of the raw data in the present corpus material.

2. Why build a NEST?

Easy access to electronic resources during the past two decades has greatly facilitated research into learner language. Most learner corpora are “collections of texts produced by foreign or second language learners” (Granger 2004: 124), compiled with the purpose of analyzing language learners’ interlanguage, often within the framework of contrastive interlanguage analysis or computer-aided error analysis (Granger 2002: 12). Learner translation corpora are “multiple translation corpora […] containing translations done by trainees rather than professional translators” (Castagnoli 2008: 36). As a subcategory of learner corpora, learner translation corpora can serve various functions: they can provide a useful pedagogical resource for teachers and students involved in a translation course, enabling the tracking of student progress and identification of individual or collective problems both of a linguistic and a translation-related nature (see e.g. Bowker & Bennison 2003); they can provide supplementary research data to already existing learner corpora, or to translation corpora of texts written by professional translators; and they can be seen as a window into the process of translation, allowing researchers to uncover specific features of various translation types, such as the translation of language for specific purposes (e.g. the ongoing compilation based on the Norwegian national translator’s exam [Translatøreksamen], TK-NHH Translatørkorpus; for a survey of learner translation corpora, see Castagnoli 2008: 37–42).

The primary aim of the corpus project NEST, described in this paper, is to provide supplementary data for researchers interested in learner language at an advanced level, enabling a comparison of the relatively free output of the type found in corpora such as the Norwegian component of the International Corpus of Learner English (argumentative essays) with learner output produced under the more constrained conditions of a translation task (cf. Kobayashi & Rinnert 1992). As a multiple translation corpus, NEST will also provide data for research on variation and choice in learner translation (cf. research on the Multiple Translations project, in Johansson 2007: 197–198).

3. Feathering the NEST: Corpus compilation and design

After having been put on ice for a few years, the NEST project idea was resumed in 2008, and the project was reported to the Norwegian Social Science Data Services. Requests for cooperation were sent to all Norwegian institutions of higher education that offer programmes in English, and several teachers signaled interest in the project. [2] From the fall semester of 2008, students taking part in translation courses at the University of Oslo, Sogn og Fjordane University College and the University of Tromsø began contributing texts to NEST.

3.1 Collection criteria

As one of the aims of NEST is to provide a supplement to existing learner corpora, the collection criteria need to be comparable if not fully consistent with those of other corpora in some significant respects. Most learner corpora enable the correlation of linguistic data with potentially relevant extralinguistic factors, and NEST contains some information about the contributing students’ background, in addition to the translated texts. All the student translators submit a simple questionnaire with information about the following variables: [3]

  • Sex
  • Year of birth
  • Nationality
  • Period of residence in Norway (non-Norwegian citizens)
  • General language background:
    • First language (L1)
    • Home language, if different from L1
    • Main language of instruction during primary and secondary school
    • Competence in other foreign languages besides English (level of proficiency: Excellent/Good/Fair/Poor)
  • General study background:
    • No. of years of university level studies
    • Type of educational institution (university or college)
    • Which type of education/degree they are aiming at (BA, MA, Teacher Education, other)
  • Background in English:
    • No. of years of English in secondary school
    • Final grade in English at secondary school (written + oral)
    • No. of semesters of university level English studies
    • No. of completed ECTS credits in English
    • Preferred standard variety of written English (British English or American English)
  • Time spent in an English-speaking environment (length of stay, where and when)
  • Background in translation:
    • No. of semesters of university level translation studies
    • No. of completed ECTS credits in translation
    • Work experience as a translator (nature of the work, languages involved, for how long)

In addition, the possibility of tracking the progress of individuals or groups of learners is appealing from the point of view of translation teachers. To cater to this idea, which requires access to longitudinal data, the original intention was that all students contributing to NEST would translate a set text at the beginning and again at the end of a teaching term, along with any other texts produced in the way of ordinary translation course work. The set text was a magazine article written for a general readership, and was intended to capture some common linguistic challenges faced by Norwegian students of English. Unfortunately, convincing students and teachers to devote time to the administration and translation of texts beyond the requirements of their courses proved a difficult task. Lacking the resources to offer any kind of compensation, the idea of a common translation for all contributing students was therefore unfortunately abandoned.

Differences in production constraints, such as the time allocated to the task, whether or not the students had access to dictionaries and other translation aids, and the effect of teaching which targets specific translation problems, can all be said to be relevant variables that ideally should be either kept constant, or at least be openly available for consideration in an analysis of the data. However, since the corpus material is being gathered from several different student groups and educational institutions, consistent information regarding these variables has proved difficult to get hold of. Apart from the questionnaire and the set text, no qualitative requirements are thus imposed on the inclusion of texts in the corpus. Rather, priority has been given to collecting a substantial number of texts, from which subsets may be extracted if desirable.

Some other unforeseen problems presented themselves at the outset of the data collection. Firstly, translation into the foreign language has recently been removed from many of the higher level English programs at Norwegian colleges and universities, thus providing relatively few sources from which texts could be harvested. Secondly, even having students (or their teachers) submit a copy to the project of the texts they were producing as part of their course requirements turned out to be a problem, and after six months with meagre results, the project was almost buried. However, after the initial setbacks, thanks to the cooperation of a few interested teachers, the prospective corpus contains data from more than 100 students, amounting to around 120,000 words. [4] The collection of texts will continue until December 2011.

4. The NEST eggs

This section briefly describes the contributors to the corpus material, and some aspects of the texts. Note that the presentation is merely intended to give an illustration of the kind of data contained in NEST at the time of writing (see endnote 4).

4.1 The learner translators

The contributing students are enrolled in English studies at four different institutions, two universities and two university colleges, located in different parts of Norway. All contributors have signed a consent form allowing their texts and some anonymized data to be stored and used for research purposes. As of May 2010, the NEST project included data from 93 female and 29 male students, the majority of whom are born after 1980. The peak year is 1987, which means that many of the students were approximately 22 years old at the time of translation (one student was 82 years old!).

A large majority of the students have Norwegian as their first language (L1). Two are bilingual Norwegian/English, two have English as their L1, and the rest are divided between several different L1s (Russian, Uigur, Ukranian, Urdu, Vietnamese, Cantonese, Berber, German, Spanish, Ewe and Japanese).

All the students with a Norwegian educational background had between eight and thirteen years of English as a school subject prior to their university studies. The group of contributors is evenly divided between lower-level courses (i.e. they are first-year students of English at university level) and intermediate-level courses; none are enrolled in graduate level courses. Half of the students have completed less than 60 ECTS credits in English, while about one fifth have completed between 60 and 90 credits, and five students have 90 credits or more. Those who say they prefer to model their written language on British English slightly outnumber students who prefer American English (six to four). In addition to the English studies, some have studied translation at various levels; about one fifth have completed between 5–10 ECTS credits in translation, while three students have from 20–70 credits. Eighteen of the students report that they have held jobs that involved some element of translation; most of them from Norwegian into English. [5] Furthermore, a number of students report spending time in environments where they have used English on a daily basis: about one fifth for a duration of between six months and a year, and another fifth for one year or more.

The typical contributor is a 22-year-old female student from the University of Oslo. She has studied English for three years in upper secondary school prior to her university studies, with the average grade 5 in English (roughly corresponding to a B+/A-). She is in her first year of English studies, and has no previous experience with translation, neither as an academic subject nor professionally. She has visited English-speaking countries on holidays and shorter trips, and prefers to model her language on British English.

4.2 Source texts and translations

At present, the NEST material includes eighteen different Norwegian source texts, ranging in length from about 200 to 900 words (the mean length is 440 words). Most of the texts are factual prose texts written for a general readership in an everyday language and style. The texts in the more advanced level courses, in addition to being longer, display a wider variety of text types and styles, and include some samples of instructional texts and information pamphlets, as well as fictional prose.

All of the texts are authentic published texts, but a few have been slightly altered to incorporate specific translation challenges, or to avoid particular problems. This is also a factor in the selection of texts for translation practice in the first place; texts are usually chosen because they contain one or more specific challenges for the student translators, of a contrastive linguistic (grammatical or lexical) or translation-related nature.

The NEST project material numbers around 214 translations; approximately 121,000 words in total. Some of the Norwegian source texts have only a few translations, but more than half of the eighteen source texts have ten or more translations each. Two texts have close to 40.

During the summer of 2010, all the texts collected so far were coded with student background data from the questionnaires, prepared and run through the alignment program Translation Corpus Aligner (TCA) 2. [6]

5. Some samples from the NEST corpus data

As a search interface has not yet been installed, the raw data presented in this section have been extracted manually from the text files, and are therefore only intended as illustrations of the type of data contained in the corpus material, not as bona fide research findings.

5.1 Nice lexical teddy bears

A well-known feature of advanced learner language is the tendency to overuse certain general core lexical items, so-called lexical teddy bears (Hasselgreen 1994). This tendency may manifest itself in student translations as well, as evidenced by the following concordance from different texts in NEST containing the element nice:

(1) erience which Bjørnson recalls as  nice and bright while his wife thinks of it
on remembered this as a bright and  nice experience, but Mrs Karoline recalled
dates back to the Viking era, is a  nice farm with about 500 acres of land. In
a lively music in this incredibly  nice vacation paradise. Just about on
n old four-seat airplane with the  nice-sounding name The Wings of Hope.
he harness and landau is shiny and  nice. It is a ceremony Bjornson uses as an
the harness and the carriage were  nice and shiny in front of the doorstep. It
and the carriage are polished and  nice. That is a ceremony which Bjørnson
t the harness and the carriage are  nice and shiny. It is a ceremony Bjørnson
am Arrival of the guests It is a  nice summer's day with a little breeze at
came to the front door to show the  nice and shiny carriage. This is a
of Norway, close to Lillehammer),  nice weather with gentle breeze this day.
e harness and landau are shiny and  nice. This is a ceremony where Bjørnson
party, and everything sparkles so  nice. The dining table is quite a view,
o practice, which resulted in many  nice and tasteful food experiences for the
how that the harness and landau is  nice and shiny. It is a ceremony Bjørnson

Most of the occurrences of nice in example (1) are prompted by two source text adjectives, fin and god, and in several cases, the translation into nice represents an appropriate choice. In other cases, the choice could be an indication of lack of awareness of the many translation alternatives that exist for these adjectives in English, or of a limited lexical repertoire among students. An analysis of the source text elements that prompted these translations, as well as alternative choices that occur in other translations of the same source texts, may contribute to raising the students’ awareness of issues pertaining to e.g. lexical choice and collocation. It would also be of interest for students to compare data like these with data from corpora of professional translations and English original texts, such as those contained in the English-Norwegian Parallel Corpus (see Johansson 2007).

5.2 Translations of himmel

Another example illustrates lexical choice in the translation of a single Norwegian word, himmel, the meaning of which covers the core senses ‘sky’ and ‘heaven’. Examples (2) to (5) are taken from three different source texts; example (2) from a text about the hard life in a small island society in earlier times; example (3) from an article about the ancient Chinese art of drinking tea, and examples (4) and (5) from a popular book about the Northern lights. The first two texts have been translated by eighteen students each, and the third text by fourteen students. The translations marked (a) represent the most frequent choice of translation equivalent in the student translations.

(2) Og der vi står […] lever likevel tankene i oss om en tid da sliterne ute i havgapet levde under hardere forhold – men kanskje også en høyere himmel.
(2a) As we stand […] we are left with thoughts of a time when toilers way out at the mouth of the fjord lived in much harder circumstances – but perhaps also under a clearer sky.
(3) Ifølge en gammel kinesisk tradisjon skal teen plukkes tidlig om morgenen under en klar gråblå himmel, […]
(3a) According to an old Chinese tradition the tea should be gathered early in the morning bellow a blue gray sky, […]
(4) […] lysende skyer, gule og hvite, med lange stråler som lyste opp bakken. Noen sier at det er himmelens sverd, mens andre tror det er dype hull, […]
(4a) […] lightening clouds that were yellow and white, rays of light that lit up the ground. Some say that it was the sword of heaven, while others believe that it was deep holes […]
(5) Noen sier at det er himmelens sverd, mens andre tror det er dype hull, med store flammer, i himmelen.
(5a) Some say it is the sword of the heavens, whilst others believe it is a deep hole, with large flames, in the sky.

Only in example (3) does the word himmel unambiguously refer to the physical space above our heads. In the other examples the reference is uncertain or ambiguous. The translation choices for all four examples are summarized in Table 1. As the table shows, the translation equivalents range from an almost unanimous choice of sky/skies in the translations of example (3) to a much more varied picture in the translations of example (2), in which the word himmel could be interpreted as a metaphorical reference to living conditions or perhaps spiritual conditions, as well as to the sky above the island.

  Example (2) Example (3) Example (4) Example (5)
sky 9 17 4 10
heaven     6 4
heavens 3   4  
skies 5      
ceiling   1    
other [7] 1      
  18 18 14 14

Table 1. Summary of choices of translation equivalents for himmel in examples (2)–(5)

Analyses of student translations of ambiguous words can raise both student and teacher awareness of how to cope with ambiguity and strategies for disambiguation in translation (and other texts), as well as the importance of contextual clues for the interpretation and translation of such items.

5.3 The Norwegian generic pronoun man and its English translations

The Norwegian generic pronouns man/en and their closest lexical equivalent in English, one, represent a challenge for Norwegian learners, as the Norwegian pronouns are both more frequent and have a wider stylistic range than their English equivalents (Hasselgård, Johansson & Lysvåg 1998: 139). A large Norwegian-English standard dictionary suggests the following translation possibilities for man: 1) incl. the person you are talking to: you (you shouldn’t talk while you’re eating); 2) incl. the person who is speaking: one (one does one’s best); 3) people in general: they, people (they say he’s very rich; people saw him climb the wall) (Henriksen & Haslerud 2001).

Johansson (2007) contains a study of the generic person in English, German and Norwegian, based on data from the expanded English-Norwegian Parallel Corpus, ENPC. As a number of the source texts in NEST contain uses of man, we can turn one of the research questions in Johansson (2007: 176) around, and ask: how is Norwegian generic man translated into English? The following examples are taken from two texts, both of which have fourteen different student translations. The first source text is an extract from a magazine article about life in Norway during the Second World War. All three examples from this text use the pronoun to refer to people in general, i.e. dictionary sense 3, cited above. In the first example (6), the referent is clearly identical with the subject in the preceding clause, as is reflected in the choice of the translation equivalent they in the majority of the student translations (exemplified by 6a):

(6) Blant annet startet avisen A/L Grisebingen på Ryen gård, man leide et større areal for potetdyrking på Voksen gård, […]
(6a) For example the newspaper started the shareholder’s company ”Grisebingen” on Ryen farm. They also rented a larger area for growing potatoes on Voksen farm, […]

Example (7) also refers to people in general, and this time people is the most frequent translation equivalent, as in example (7a):

(7) For ytterligere å øke den hjemlige beholdningen, la man ut på reiser.
(7a) To further increase the food supplies at home, people started travelling.

The third example (8) is an adapted biblical quote where one is the most frequent translation equivalent, as in (8a). Here, already existing renditions of the quote found on the internet may have affected the students’ choices, since they have had access to ICT aids in their work with the translation. [8]

(8) Langsomt gjennom de fem okkupasjonsårene høstet voksne, barn og ungdom nye erfaringer i å overleve. Man levde ikke av brød alene.
(8a) Slowly through the five years of invasion adults, children and young people gained new experiences in order to survive. One did not live by bread alone.

The choice of translation equivalents for man in examples (6)–(8) is summarized in Table 2. The table shows that rather than choosing randomly between alternatives given in a standard dictionary, the students’ choices in each case display a great deal of agreement. A study of the “deviant” choices might shed further light on this particular translation problem, as might a comparison of variation and style within one and the same translation.

  Example (6) Example (7) Example (8)
they 9 1 2
people   11 3
one 2 1 9
passive voice   1  
ellipsis 2    
other 1    

Table 2. Summary of choices of translation equivalents for one in examples (6)–(8).

The second source text is a recipe for a traditional Norwegian wafer cake, and exemplifies an instructive text type where Norwegian employs linguistic devices that are not usually found in corresponding texts in English, viz. the passive voice and the generic pronoun man. English recipes, by contrast, typically contain instructions using the imperative form of the verb, without an expressed subject. The meaning of man in most of the following examples corresponds most closely to dictionary sense 1 above, but the standard dictionaries lack pragmatic information about the use of man in this particular text genre. The translations provided in the (a) versions exemplify the most common translation equivalent in the corpus material:

(9) Deretter tar man i ca. 5 dl mel, ¼ ts ingefær eller kanel, et par sitrondråper.
(9a) Then add approx. 2 cups flour, ¼ tea spoon ginger or cinnamon, a couple of lemon drops.
(10) Så legger man i en full spiseskje av røren, jernet trykkes sammen, men så lett at røren ikke tyter ut på sidene.
(10a) Pour a full tablespoon of the mixture on the iron, close it, but gently enough to prevent spilling anything down the sides.
(11) Man kan også ha mer mel i røren så kakene blir tykkere.
(11a) You can also add more flour to the batter so that the cakes will be thicker.
(12) Da ruller man dem ikke, og slike kaker kalles ”avletter”.
(12a) Then you do not roll them onto a cone, and such cakes are called “avletter”.

The student translators’ choices are summarized in Table 3. Again, we see that there is a clear preference across the group for the same choice in examples (9) and (10), viz. the imperative verb form with no expressed subject. Example (11) contains a modal form which is not readily expressed in an imperative construction, and the negative form of (12) may be felt to appear too much like a warning if translated into an English imperative sentence.

  Example (9) Example (10) Example (11) Example (12)
you 2   6 8
imperative 14 14 2 2
one     3  
they       2
passive voice     1 3
other     2 1

Table 3. Summary of choices of translation equivalents for one in examples (9)–(12)

Results of the kind illustrated in Tables 2 and 3 elucidate the kind of linguistic contexts that may create problems for Norwegian learners of English, either because of the existence of many alternative translation equivalents with no clear indication of which one is “better”, more appropriate or more idiomatic in English, or because the source text contains constructions which do not easily correspond to English equivalents, and thus emphasize contrastive differences.

6. Research potential

As a task type, translation may in some ways be regarded as a form of language elicitation experiment, where the student translator is forced to make choices that might be avoided in freer production. Given possibilities for advanced searches, and hence, more sophisticated analysis, the kind of data samples presented in the previous section represent an interesting supplement to the data contained in already existing learner and translation corpora. Part-of-speech (POS) and error tagging would open up possibilities for comparisons of vocabulary frequencies, error types, various types of unidiomatic language, and the use of discourse strategies. Comparisons might also shed light on the possible merits of including translation into the foreign language as part of an advanced language study program.

The texts can also be analyzed with respect to translation strategies such as simplification and explicitation, as well as the relationship between grammatical and lexical choices in the students’ tackling of translation problems.

Many of the contributed translations include commentaries written by the students pertaining to various translation challenges that have been chosen by their teacher, e.g. choice of lexical translation equivalents, word order, collocation, etc. This material, which at present only exists as separate text documents, represents a valuable source of additional information explaining individual choices, and could also be linked to the corpus and explored in combination with qualitative and quantitative analyses of the corpus material.


[1] The conception was in large part due to the late professor Stig Johansson and his unique ability to see research potential and encourage new corpus-based projects.

[2] In cooperation with Sylvi Rørvik (Hedmark University College).

[3] The questionnaire design was in part inspired by the profile used for the Longdale project (Longitudinal Database of Learner English), kindly made available to us by Sylviane Granger (August 2008; personal communication).

[4] Since the presentation of the paper on which this article is based (in May 2010), more student texts have been and are being submitted for inclusion in the corpus. However, as these texts have not yet been prepared and collated, they are not included in the following figures.

[5] Incidentally, the fact that university students of English are sometimes employed as translators may be a point in favor of reintroducing translation into the foreign language as part of the course portfolio of university language programs.

[6] Developed at  AKSIS/UNIFOB, University of Bergen; see for a fuller account (in Norwegian) of the alignment program.

[7] The category “other” here and in the next tables includes various alternative solutions, such as paraphrase, shortening, etc., where no direct correspondence to the source text element can be identified.

[8] Translations of the quote from Matthew 4.4 on the website most frequently contain man, but also include a few examples with people and one.


Bowker, Lynne & Peter Bennison. 2003. “Student Translation Archive: Design, development and application”. Corpora in Translator Education, ed. by Federico Zanetti, Silvia Bernardini & Dominic Stewart, 103–117. Manchester: St. Jerome Publishing.

Castagnoli, Sara. 2008. Regularities and Variations in Learner Translations: A Corpus-based Study of Conjunctive Explicitation. PhD dissertation, University of Pisa.

Castagnoli, Sara, Dragos Ciobanu, Kerstin Kunz, Natalie Kübler & Alexandra Volanschi. 2011. “Designing a learner translator corpus for training purposes”.  Corpora, Language, Teaching, and Resources: From Theory to Practice, ed. by Natalie Kübler. Bern: Peter Lang.

“ENG1102 - Translation and Practical Exercises, An Introduction”. Course description, University of Oslo, 2009.

Granger, Sylviane. 2002. “A bird’s eye view of learner corpus research”. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, ed. by Sylviane Granger, Joseph Hung & Stephanie Petch-Tyson, 3–33. Amsterdam & Philadelphia: John Benjamins.

Granger, Sylviane. 2004. “Computer learner corpus research: Current status and future prospects”. Applied Corpus Linguistics: A Multidimensional Perspective, ed. by Ulla Connor & Thomas A. Upton, 123–145. Amsterdam & New York: Rodopi.

Hasselgård, Hilde, Stig Johansson & Per Lysvåg. 1998. English Grammar: Theory and Use. Oslo: Universitetsforlaget.

Hasselgreen, Angela. 1994. “Lexical teddy bears and advanced learners: A study into the ways Norwegian students cope with English vocabulary”. International Journal of Applied Linguistics 4(2): 237–58.

Henriksen, Petter & Vibeke Haslerud, eds. 2001. Engelsk stor ordbok. Oslo: Kunnskapsforlaget.

Johansson, Stig. 2007. Seeing through Multilingual Corpora. On the Use of Corpora in Contrastive Studies. Amsterdam & Philadelphia: John Benjamins.

Kobayashi, H. & C. Rinnert. 1992. “Effects of first language on second language writing: Translation vs. direct composition”. Language Learning 42: 183–215.

“Matthew 4.4. Parallel translations”. See also endnote [8].

Rydning, Antin Fougner. 1994. “Kan den interpretative oversettelse berike fremmedspråkpedagogikken?“ Romansk Forum 1: 25–37.

“TK-NHH Translatørkorpus − Et multilingvalt fagspråklig parallellkorpus under oppbygging”. Norwegian School of Economics.

“Translation Corpus Aligner (TCA) 2”. Aksis, University of Bergen.