From Helsinki through the centuries: the design and development of English diachronic corpora

Thomas Kohnen, University of Cologne


This paper gives an overview of the major issues connected with the design and development of English diachronic corpora. It addresses challenges of diachronic corpus design (for example, corpus size, the erratic distribution of surviving texts, genre continuity, and the lack of pragmatic and sociolinguistic information about texts) and it gives a survey of the most influential English diachronic corpora, tracing a major development from "long and thin" to "short and fat" corpora. On the other hand, it suggests some directions for corpus design which include research connected with the compilation of the Corpus of English Religious Prose at the University of Cologne. These suggestions deal with the sections and functions of texts as possible links between corpora, the concept of a "stratified corpus", and the distinction between "producer genres" and "receiver genres" in diachronic corpora.


Ever since the newly evolving field of corpus linguistics developed a separate branch of historical studies, this section has been firmly associated with the University of Helsinki. Matti Rissanen and his team, in compiling the diachronic part of the Helsinki Corpus, broke new ground in historical corpus linguistics and set the scene for all following diachronic English corpora. If one wanted to understand historical corpus linguistics as a trip to past centuries, one could say that trips of this kind have started, as far as electronic diachronic corpora are concerned, from Helsinki. In this paper I attempt to give an overview of the design and development of English diachronic corpora, employing the metaphor of the trip to the past. The paper falls into three parts. The first section ("Paving the way") deals with preparations for the trip and addresses major challenges of diachronic corpus design. The second section ("Tracks to the past") gives a survey of some of the most important English diachronic corpora. In the third section ("Tips on future trips") I will look ahead and suggest some directions for corpus design. These suggestions are mainly based on my own research and the experience of the compilations connected with the Corpus of English Religious Prose at the University of Cologne.

Whereas the first two sections aim at a compact overview (with all the limitations due to limited space), the last part is supposed to raise some novel questions which, I hope, will contribute to and advance the discussion on the future design of historical corpora.

1. Paving the way: challenges of diachronic corpus design

The principles of corpus design have always been a central issue in corpus linguistics, both synchronic and diachronic. They form an important part of most standard textbooks in the field (see, for example, the relevant chapters in Biber et al. 1998, Kennedy 1998, Meyer 2002 or Hunston 2002). I would like to start here by focussing on some selected general issues of corpus design from a purely synchronic perspective, and look at some statements which may spotlight the essential problems corpus linguists have to face.

A first major difficulty is connected with corpus size. Corpora, as principled collections of texts, should be large, but how large exactly must corpora be in order to qualify for valid linguistic research? Surveying the field, one gets the impression that even in the age of so-called second-generation mega corpora, researchers seem to be less confident about the "definite" size of corpora. Kennedy's statement may be symptomatic:

At this stage we simply do not know how big a corpus needs to be for general or particular purposes. (Kennedy 1998:68).

Another central concern is representativeness. It is a well-known tenet in corpus linguistics that all corpora must aim to be "representative" of the language or variety from which they stem if they want to serve the purposes of linguistic research. But can corpora be representative? When we are dealing with representativeness, not as an approximate aim, but as an absolute category, many researchers are very reserved. Mukherjee, for example, says in his state-of-the-art report:

Absolute representativeness is an unattainable aim. (Mukherjee 2004:114).

Many corpus designers say that a corpus may "improve" its representativeness if it has the correct balance or weighting. Balance involves a wide range of genres or registers and their appropriate proportionate representation in the corpus. Here it seems that the general attitude of corpus linguists is even more pessimistic. The following extract can be found in a recent textbook:

.. any claim of corpus balance is largely an act of faith rather than a statement of fact. (McEnery, Xiao and Tono 2006:16).

After these three spotlights from a synchronic perspective we may return to historical corpora. What happens to the challenges plaguing synchronic corpus linguists if we move to the field of diachronic linguistics, that is, to corpora containing data from the Old English, Middle English and Early Modern English periods? Generally one has to say: Things get worse. This is mainly due to two problems: First, diachronic corpora cover a period of time, not a point in time, and, secondly, diachronic corpora (usually) have to put up with the inferior quality of the available historical data.

The question of the time frame of diachronic corpora and the dire consequences this may have for corpus size is a good illustration of how problems of synchronic corpora turn out to be even more serious in diachrony. A comparison with synchronic corpora is instructive here since it reveals the dimensions involved. The Brown Corpus, a typical "first-generation corpus", is based on written data stemming from one year, 1961, and contains one million words. If we assume one million words per year as a reasonable standard of corpus size, should each year in the history of the English language be represented by one million words? In this case the Helsinki Corpus, which basically stretches from the end of the eighth century till 1710, should contain something like 910 million words. A comparison with its real size (1.5 million words) shows the challenge involved. The Helsinki Corpus covers just 0.16 per cent of the imaginative corpus size. I would not care to argue that the Helsinki Corpus should in fact contain 910 million words, but the challenge in terms of corpus size must be seen against the background of possible claims on representativeness. Any claim on representativeness based on multi-genre corpora covering many centuries has to be rather modest.

The second major problem, or rather complex of problems, is connected with the poor quality of the available historical data. Here are many defects which are well-known and which have been mentioned quite often in the literature (see, for example, Rissanen 1992, Kytö and Rissanen 1992). [1] First of all, diachronic corpora must do without records of spoken language. We find only few accurate and reliable representations of spoken discourse in the periods preceding the 20th century. Although there are systematic approximations to spoken language which try to repair the defects of the ways in which it was noted down in written documents (see, for example, Culpeper and Kytö 2000), the basic situation cannot be mended.

If we look at the available written data, we find that the documents which have come down to us are often quite fragmentary. Many texts got lost or were destroyed in the course of the centuries. In addition, many texts (especially verse texts) cannot be dated with any accuracy. If we manage to conjecture an approximate date, the date of composition and the date of the manuscript have to be distinguished, and they may be decades or even centuries apart. The twofold dating creates serious complications in the design of diachronic corpora. All this adds up to the basic question: How can the sketchy, erratic chronological distribution of text documents in the history of English be realigned with the requirement of a systematic, balanced structure of corresponding periods in a diachronic corpus?

The haphazard preservation of texts leads to another basic drawback in diachronic corpora, the inadequate representation of genres. Gaps in the documentation of genres, genre evolution, genre change and genre loss are common phenomena in the history of English. The number and distribution of different genres seems to change almost every century. But many if not most diachronic corpora are arranged according to categories of text type or genre (e.g. the Helsinki Corpus). How can the necessarily static pattern of genres in a diachronic corpus capture the fluctuation in the number and distribution of different genres across the centuries?

It goes without saying that the erratic distribution of surviving texts also creates difficulties with regard to the distribution of regional variants and dialects across time. Here a representative, balanced distribution seems hardly possible.

Another major problem that follows from the unsatisfactory quality of the data is the lack of pragmatic and sociolinguistic information about the texts which have come down to us. In several cases we even have to do without basic information about the context and setting, the purpose and aims of a text. Sometimes we know neither author nor addressee (most medieval texts are typically anonymous). If we manage to identify them, we often lack more detailed knowledge about their social and economic status or about their education. So, in the sampling of historical corpora, how are we to systematically control variables like age, gender, education and social relationships?

Most diachronic corpora, due to the problematic status of the data, rely on editions of texts. This may create even more difficulties. Editions of texts often present texts in a "synthetic" version which is different from any of the versions found in the various manuscripts. They make different assumptions and reach different decisions about the presentation of text, following their varying editorial policies. The question is: Can these "artificial" and "synthetic" editorial products still be seen as authentic evidence of language use?

After this brief outline of the challenges involved in the compilation of diachronic corpora one might be led to believe that historical corpora are necessarily unreliable and that diachronic corpus design is faced with insurmountable obstacles. I think this would be jumping to conclusions. First of all, it should be kept in mind that the evolution of historical corpus linguistics cannot be held responsible for the basic defects. Historical linguists have always struggled with bad data. Rather, it is due to the achievements of historical corpus linguists that these drawbacks have been highlighted in a systematic way and that solutions have been found to mend at least some of the defects of the historical data. And in this way it may even have been that historical corpus linguistics has enhanced the appreciation of the difficulties of corpus design in synchronic studies.

2. Tracks to the past: some major English diachronic corpora

This survey must start with a caveat. Given the limited space of this paper, it cannot even attempt to cover all major English diachronic corpora nor can it include detailed descriptions of them. [2] Rather, it is supposed to give a selection of important corpora with the primary aim of illustrating some of the basic problems pointed out in the preceding section and showing some of the solutions which have been offered. In addition, this small-scale survey forms the necessary background for the suggestions concerning corpus design which will be presented in the third section of this paper.

2.1 The Helsinki Corpus

As was mentioned in the introduction, the diachronic part of the Helsinki Corpus has been the pioneering, groundbreaking work in diachronic corpus compilation. Even after more than twenty years it can still be called the historical English corpus which is most comprehensive (in terms of the time covered and the text types included) and most accessible. [3] The Helsinki Corpus shows a serious concern for all major problems of diachronic corpus design and it will provide a solid general foundation for most corpus-based studies which aim at high-frequency items and their structural and / or lexical development in the history of English.

The basic outline of the Helsinki Corpus is well known and mentioned fairly often in the literature. One major aim in the compilation was to collect representative data both for synchronic studies of limited periods and diachronic studies of long-term developments. In order to achieve this aim, text samples representing an impressive number of different genres were arranged across eleven sub-periods. Apart from the text-type coding, the excerpts were also coded according to other labels, giving, where possible, parameter values for the dialect, the level of formality of the text, the relationship between the writer and the receiver, the author's age, sex and social status etc. It goes without saying that within these parameter values the compilers of the Helsinki Corpus tried to achieve a most balanced selection of excerpts.

One important innovation of the Helsinki Corpus is the concept of the so-called "diachronic text prototype" (e.g. secular instruction, religious instruction, non-imaginative narration, imaginative narration). Diachronic text prototypes comprise several genres. For example, the prototype "imaginative narration" covers fiction, romance, travelogue and geography. Diachronic text prototypes may serve to enhance genre continuity: If a certain text type is not found in a sub-period of the Helsinki Corpus, we may still retrieve excerpts representing the text prototype there.

Despite the undeniable and ground-breaking merits of the Helsinki Corpus, potential users should know possible limitations and shortcomings. In fact, the compilers of the Helsinki Corpus never denied possible defects and had an admirably sober attitude towards the possible restrictions of the results yielded by the corpus (on the "diagnostic" value of the results see, for example, Rissanen and Kytö 1993:3-4).

The shortcomings mainly concern the chronological distribution of the text excerpts and text type continuity (see Kohnen 2004a:104-108). Since the compilers wanted to permit synchronic as well as diachronic studies, we find some gaps on the one hand and accumulations of texts concentrated in short periods on the other. [4] Also, the compilers of the Helsinki Corpus decided to arrange the data according to the date of the manuscript, not the date of composition. With many syntactic and pragmatic studies the date of the composition seems preferable. A reorganisation of the corpus following the date of composition results in major changes in the Middle English sections, leaving, for example, the second half of the 15th century without religious treatises.

2.2 From "long and thin" to "short and fat"

An "indirect" achievement of the Helsinki Corpus has been that it has highlighted the basic limitations of large-scale, multi-genre diachronic corpora. Matti Rissanen called such corpora "long and thin" (2000: 9), which means that they cover a period which may be too large and contain data for each text type which may turn out to be too small. If we follow Rissanen's account and want to improve diachronic corpora, we can only conclude that the next generation of corpora had to be "short and fat".

Short and fat has indeed set the scene for the major follow-up corpus projects after the Helsinki Corpus. These corpora are short in that they are restricted to more limited time frames and they are fat in that they aim at a more extensive coverage of texts, often focussing on the evolution of specific genres or forms of communication. They also include data with more socio-historical information on addressors and addressees and they systematically approach spoken language. This survey contains select examples of short and fat corpora. They will illustrate the inclusion of socio-historical information, the more extensive coverage of text and the inclusion of spoken language.

A good example of the inclusion of socio-historical information and text coverage is the Corpus of Early English Correspondence. It contains ca 2.7 million words in about 6,000 letters written by 777 writers in the period between 1417 and 1681. [5] Letters have proved to be among the few historical genres which are open to sociolinguistic investigations. For many letters individual addressors and addressees can be identified whose age, socio-economic status and education is known. Within the framework of the Corpus of Early English Correspondence it has become possible to study the influence of sociolinguistic variables such as age, sex, education on language use and language change. Many studies based on the corpus have shown that the corpus-based socio-historical approach in fact yields valuable results (see, for example, Nevalainen and Raumolin-Brunberg 2003). So, this corpus may serve as an instructive illustration of how historical corpus linguists have found innovative solutions to the problems created by the defective historical data.

A good example of a follow-up corpus with a focus on genre evolution is the Corpus of Middle English Medical Texts, which is the first part of the Corpus of Early English Medical Writing (Taavitsainen et al. 2005; see also Taavitsainen and Pahta 2004). It covers the time between 1375 and 1500 and contains ca 500,000 words. The corpus follows the major sub-genres of medical writing. It comprises surgical texts (learned academic treatises), specialised texts (academic treatises on specific topics) and remedies (recipes, prognostications, herbals etc.), with extensive text coverage. Thus it allows a detailed investigation of the evolution of medical writing and the associated text types (see, for example, Taavitsainen 2002, 2005a and b).

The Lampeter Corpus of Early Modern English Tracts is another example of a corpus with a focus on a special text type (Schmied 1994; Schmied and Claridge 1997; Siemund and Claridge 1997). It contains about 1.1 million words in 120 texts by 120 authors, covering the period between 1640 and 1740. Tracts, which may be called a publication type rather than a text type, show a high degree of variability. The compilers have coped with this variability by subdividing the texts into several domains (religion, politics, economics / trade, science, law).

A last example of a genre-focused corpus is the ZEN (Zurich English Newspaper Corpus; Fries and Schneider 2000). It contains a collection of early English newspaper extracts ranging from 1661 to 1791. Here, representativeness is improved by coverage of the different sub-sections or sub-genres included in newspapers, that is, different kinds of news reports (e.g. home, foreign and ship news), advertisements, announcements, obituaries, letters etc.

There is one corpus among the follow-up corpora which systematically deals with one of the most notorious drawbacks of historical data, the lack of spoken records. The Corpus of English Dialogues (1560-1760) offers a principled selection of texts which may allow a systematic approximation to the spoken language of the period (see Culpeper and Kytö 1997, 2000). It contains those text types which show a close affiliation to spoken language in that they record spoken discourse. These include records based on authentic dialogues (trials, witness depositions) and constructed dialogues (comedies, (dialogues in) prose fiction, and (dialogues in) handbooks and manuals).

It goes without saying that there are many more corpora and corpus projects which attempt to cope with the challenges of diachronic corpus design but which - for reasons of space - cannot be included in this presentation. These are corpora that focus on a limited time span (for example, A Corpus of Nineteenth-Century English, Kytö et al. 2006, or A Corpus of Late Modern English Texts, de Smet 2005) and corpora which focus on a specific variety (e.g. The Helsinki Corpus of Older Scots, see Meurman-Solin 1993, 1995). There are tagged and parsed corpora [6] and there are other large-scale multi-genre corpora (e.g. ARCHER). [7]

2.3 Full-text databases and dictionary corpora

The rest of this part will be devoted to examples of two kinds of electronic databases which may not be called corpora in the full sense of the term and which are sometimes said to compensate for or even mend some of the drawbacks of diachronic corpora proper. These are full-text databases and electronic dictionaries that can be used as corpora.

The Dictionary of Old English Corpus is an impressive data base with virtually all Old English writings which have come down to us, comprising 3.5 million words (see Healey 1999). This corpus can, however, be called neither balanced nor representative because it just reflects the state of haphazard text transmission that we face in Old English.

The Innsbruck Computer Archive of Machine-Readable English Texts (ICAMET) contains, apart from a small section of letters, a collection of 128 full-text databases comprising about 5.6 million words (see Markus 1991, 1994 and 1999). The major advantage of ICAMET is that it offers complete texts rather than excerpts of texts. Thus it makes text-linguistic and pragmatic analyses possible. On the other hand, restrictive copyright regulations prevent the free distribution of many complete texts contained in the corpus, and the coverage of the Middle English prose genres is often incomplete.

Other kinds of historical databases comprise the huge collections of citations which function as evidence and illustration in the standard historical dictionaries. Here I only mention the well-known electronic versions of the Middle English Dictionary and the Oxford English Dictionary. Used as a corpus, both electronic dictionaries provide a vast number of short text fragments, which together constitute a substantial collection of evidence of language use in a wide range of different domains and text types.

Several corpus linguists have raised doubts about the representativeness and balance of dictionary corpora. It seems rather difficult to specify the exact size of the corpora, the distribution of genres can hardly be traced, and the quotations are too short and difficult to contextualise. In addition they sometimes seem to be biased towards certain sub-periods (for example, Late Middle English in the case of the MED) or prominent authors (for example, Shakespeare in the case of the OED; see Hoffmann 2004). The great advantage of dictionary corpora is, however, the sheer size of the data, which makes them extremely useful for complementary investigations in those cases where the standard corpora do not yield enough examples (see, for example, Kohnen 2004b).

3. Tips on future trips: some directions for corpus design

The third part of this paper is devoted to some directions for future corpora. The suggestions I make will mainly reflect the difficulties and new challenges connected with the current compilation of the Corpus of English Religious Prose at the University of Cologne. In particular, I will mention possible disadvantages of short and fat corpora and the possibilities of linking them up, I will focus on a possible kind of link, the functions and sub-functions of the texts included in corpora, I will introduce the concept of a "stratified" corpus and I will discuss representativeness and "receiver genres" in diachronic corpora.

3.1 Beyond "long and thin" and "short and fat": closing gaps and finding links

In the previous section I have shown how diachronic corpora have turned from long and thin to short and fat. Short and fat corpora are necessary to cope with the difficulties of corpus design, especially, they reflect the need for more substantial text evidence in sub-periods of the English language. On the other hand, short and fat corpora leave at least two potential gaps.

First, the number of large single-genre corpora is somewhat limited to date and there remain quite a few registers or genres in the history of English which wait for extensive documentation in the form of a fat corpus. Secondly, compilers of single-genre corpora usually focus on limited time spans. Thus they run the risk of losing touch with the larger perspective of the complete history of the English language.

In my view, compilers of diachronic corpora should give up the dichotomy of long and thin and short and fat. The efforts of diachronic corpus compilation should be directed towards closing the gaps left so far by single-genre corpora and finding possible links between existing corpora.

The closing of gaps may turn out to be difficult, simply because some genres developed late and an extension backwards is impossible (e.g. in the case of private letters). Therefore it seems of paramount importance to select domains or genres which have the potential of covering larger portions of the history of English. Against this background it seems quite surprising that a specialised corpus of English religious prose has not been compiled so far. Religious genres cover the whole history of the English language. Especially until the 18th century, they seem to be among the most wide-ranging, most relevant and most accessible genres in the history of the English language. [8]

3.2 Possible links: sections and functions of texts included in diachronic corpora

If the implementation of links between existing and projected corpora is among the more urgent requirements of diachronic corpus design, corpus linguists must focus on possible factors which constitute such links. In my view, subsections and sub-functions of texts and text excerpts belong to such factors.

Subsections and their functions belong to the topics which are very much discussed in chapters on the compilation and design of corpora in textbooks on modern corpus linguistics: Should corpora contain complete texts or just samples of texts? The received wisdom on this point is usually that excerpts are manageable in size, contribute to comparability, orderliness and representativeness of corpora, whereas whole texts are in many cases simply too large for most corpora (see, for example, Mukherjee 2004:113). However, full texts have the great advantage of providing all subsections and sub-functions of the text. This may not only contribute to a complete picture of a text or text type, but also allow text-linguistic and discourse-functional analyses. The alternative of full text versus sample is often presented as a fixed dichotomy, with samples usually being preferred to whole texts (see, for example, Meyer 2002:38-40).

The compilation process of the Corpus of English Religious Prose will follow a compromise between a "whole-text" and an "excerpt" strategy. We will first identify and systematically account for the relevant, that is functionally differing subsections of the potential texts which are to be included in a corpus. Then we will only select those excerpts which represent these relevant subsections and sub-functions. In the end, the files included in the corpus should not contain whole texts but rather systematically reflect their prominent, functionally differing subsections. Thus, a multiple classification of text excerpts could be achieved: On the one hand, texts could be assigned to the ordinary text types (treatise, sermon etc.), while on the other hand, parts of these texts could be assigned to recurring sub-functions.

Texts of religious instruction may serve as an illustration here. When I tried to set up relevant categories for the classification of texts of religious instruction, I found that the term religious instruction refers to a complex unit which seems to comprise at least five textual sub-functions: exhortation (which aims at a change of the addressee's behaviour), exposition (the explanation of the Christian belief), exegesis (the interpretation of the Bible), narration (instruction by means of exemplary stories) and argumentation (methodical presentation of arguments). These sub-functions may result in various subsections in one text. The inspection of different texts of religious instruction has shown so far that the sub-functions often combine in an unforeseeable way, with changing proportions. In the Corpus of English Religious Prose we are planning to select extracts of texts which are representative of the occurring sub-functions. We would then code the texts of religious instruction according to their traditional text type and according to the sub-functions. Thus we could, for example, access all narrative sections in sermons and homilies or all argumentation sections in saints' lives and catechisms and so on.

It seems to me that general recurring sub-functions lend themselves as excellent links between different single-genre corpora. Although most of the sub-functions will find their genre-specific manifestations in the various text types of religious prose, they can also be seen as instances of more general text functions which can be found in other, secular texts and text types. [9] And in this sense they can serve as links between different corpora.

3.3 The concept of a "stratified corpus"

One of the major difficulties in the analysis of texts belonging to the religious domain is the fact that some texts (e.g. the Bible and prayers of the Eucharist) clearly enjoy a privileged position whereas others are usually associated with a lower status in religious discourse (sermons, treatises, catechisms etc.). The concept of a "stratified corpus" may serve to reflect hierarchical distinctions like these in the design and structure of a corpus. A stratified corpus is a corpus which seeks to display the hierarchical distinctions of the discourse world in which the texts included in the corpus function. The idea of a stratified corpus is quite relevant for religious discourse but may turn out to be important in other domains as well.

If the Bible enjoys a special status in religious discourse, how can it be classified in a corpus? In many historical corpora, for example, the Helsinki Corpus, Biblical texts form a separate category or text type. At first sight, this does not seem to make much sense since the status of the Bible as a uniform genre is rather doubtful. The texts of the Bible comprise a great variety of different genres, for example, chronicles, poems, reports, narratives, prayers, laws etc. On the other hand, the canon of Biblical writings is, at least within the Christian tradition, seen as a coherent whole. It constitutes a common, definite basis, a fundamental frame of reference for theological thinking and religious practice. As a consequence, Biblical texts never change, apart from the possibilities of differing or new translations. How can this special status of Biblical writings be captured in the structure of a corpus?

The problem can be solved if the Bible and other types of religious texts are placed in the hierarchical structure of the discourse world in which these texts are supposed to function and make sense. The major interactants in this discourse world are God, as the transcendental authority, and the Christian community. The Christian community can be subdivided at least into two sub-groups, the "specialists" in religion, that is, the theologians and official representatives of the church, and the common laypeople, the ordinary members of the Christian community who do not hold a prominent position.

Within the network of these interactants there are three basic possibilities of communication which constitute the major spheres of religious discourse. The first one is God addressing the Christian community. This is what Christian tradition usually calls "God's word" or the text canon of the Bible. The second possibility is the Christian community addressing God. This is, in most cases, prayer. The third possibility is members of the Christian community addressing other members of the Christian community. In most cases this is what is called religious instruction and theological discussion. Theological discussion is usually restricted to the "specialists" in the field. It may be defined as the expert professional discourse among theologians. On the other hand, religious instruction will typically involve members of the clergy addressing ordinary laypeople, although other combinations are possible as well.

One of the advantages of the above account is that it reveals the categorical difference between the Bible and liturgical prayer on the one hand, and the genres of religious instruction and theological discussion on the other. The Bible and prayer are different not only because they are different genres but because they stem from a different sphere of religious discourse, which involves a transcendental authority. This privileged position in the world of religious discourse is shown by the fact that texts from the Bible are inflexible and fixed and that they constitute a fundamental point of reference and a common basis for justification. In addition their superordinate position is revealed by the fact that they recur as fixed elements in other (subordinate) religious writings. [10] They are prototypical manifestations of what Crystal and Davy (1969:147-172) call "the language of religion". [11]

What are the consequences for the design and compilation of a Corpus of English Religious Prose if it is to be conceived as a stratified corpus? First of all, the Bible and prayers, due to their mainly fixed form and their special status as different spheres of religious communication, should form two background sub-corpora that serve as points of reference for the main part of the diachronic corpus. The main part of the corpus should be based on the third sphere of religious communication comprising theological discussion and religious instruction. The background corpora should be linked to the main section of the corpus so that citations and references in the "ordinary" texts could be accessed. The present design of the Corpus of English Religious Prose will follow these lines.

Is the concept of a stratified corpus relevant for other domains beyond the world of religious discourse? The concept can be applied wherever certain text types enjoy high prestige and exert great influence, if they are inflexible, following fixed conventions of text structure, and if they tend to be cited in other texts stemming from the same domain. Here, possible candidates might be text types in the domain of administration and law (for example, statutes).

3.4 Representativeness and "receiver genres" in diachronic corpora

In the last part of this section I return to the issue of representativeness in diachronic corpora and look at genres from the perspectives of their producers and receivers. The most important question in dealing with problems of representativeness seems to be the question "representative of what?" (Kennedy 1998:62). The representativeness of a given corpus has to be assessed against the background of a particular reference point, which determines a reasonable "degree" or "approximation" of relevance. In answering the question "representative of what?" for diachronic corpora, it may be instructive to recall some well-known characteristic features of the linguistic situation in England before ca 1600 (or maybe even later).

During the Old English and Middle English and also significantly during the Early Modern period we find a predominance of oral communication. This did not only include everyday interaction and typical "oral" text types but also listening to written texts. Since written texts were more often read out than read silently, they were to a much larger extent part of spoken interaction. On the other hand, the production of written texts was restricted to few people and we find a typical male dominance in literacy. [12] Against this background, we may again ask the question "representative of what?".

Can a diachronic corpus be representative of the language of spoken interaction? This will hardly be possible because there are no direct records. Despite the systematic attempts at an approximation it seems extremely difficult to close these gaps in such a way that the available data could be called representative. Should a diachronic corpus be representative of the written text types? As was said before, this is the language of a small section of society, mostly the language of a male elite. Actually, it seems as if this "representativeness" is not really representative. Can a diachronic corpus be representative of the received genres of a past period, that is, representative of the texts as they were read or listened to? It seems to me that an answer to this question may well imply a more promising approach. As was said before, the typically written genres were included in spoken interaction and enjoyed thus, in addition to being read silently, a wide currency among people.

Although most of the above considerations appear to be fairly obvious and well-known, a systematic distinction between genres as received phenomena and genres as produced phenomena has not met the amount of attention it may have deserved in diachronic corpus design. I suggest for diachronic corpora, where it may be possible, a systematic distinction between what I would like to call typical "producer genres" and typical "receiver genres".

Basically, producer genre means that the most attractive side of the genre from a corpus-linguistic perspective is the producer side. Typical producer genres contain relatively short, self-contained texts, whose addressors are in many cases identifiable. Thus producer genres yield text populations with many identifiable language users, enabling linguistic analysis to describe language use as attested by many different producers. A prototypical producer genre is formed by private letters. The term receiver genre points to the fact that the most attractive side of the genre from a corpus-linguistic perspective is the receiver side. Typical receiver genres contain anonymous texts or texts connected with one or very few authors. Receiver genres usually have a wide circulation. Thus, linguistic analysis can describe language use which was (probably) picked up by many different language users. Typical receiver genres are prayers, the Bible or proclamations. Clearly, it does not come as a surprise that many if not most religious text types are situated on the receiver side. The privileged status of the Bible and Eucharist prayers may also be traced to their being typical receiver genres.

What are the consequences of a systematic consideration of the producer and receiver side of genres for the compilation and design of diachronic corpora? I will just mention a few points. First of all, one should include (if possible) systematic information about the currency and circulation of the texts, preferably as one coding parameter. And this implies that only the "best examples" of a receiver genre, that is the most popular texts, should be included. As a second step, one could work out producer and / or receiver profiles for diachronic corpora. For example, the Corpus of Early English Correspondence is a typical producer corpus. A corpus of religious prose would to a large extent be a receiver corpus but would also include producer items (e.g. sermons, when seen from the perspective of their respective authors). Thirdly, one could start looking for possible complementary producer and receiver genres in one corpus and link them, e.g. sermons and treatises vs. the Bible and prayers. The Bible and prayers were probably receiver genres, since these texts were mainly fixed and handed down from generation to generation. Sermons and treatises, on the other hand, may in part, that is, from the perspective of the "professionals" who write them, be seen as producer genres. The complementary aspect derives from the fact that the same group of people (the "professionals") will have acted as receivers (the Bible and prayers) and as producers (sermons, treatises), with a possible transfer of language use. But, whatever the possible distributions of producer and receiver genres, it seems to me that typical receiver genres and received language may form the majority of items included in diachronic corpora.

There is an additional consequence of the distinction between producer and receiver genres, which may challenge the prominent position of genre in the design and compilation of diachronic corpora. If texts as they were read or listened to are one important focus of diachronic corpora, then they should also include the standardised or prefabricated units in which written genres were received, that is, popular text collections and commonplace books. Such collections combine several advantages with regard to the analysis of received genres. They often give a clear indication of the currency and reputation of the included texts; in most cases they even form "recommended reading". They also show recurring patterns and typical combinations of extracts from different text types. Eamon Duffy's research on 15th-century devotional collections (Duffy 1992:68-77) offers impressive evidence of the fact that such collections were highly stereotyped and contained extracts of genres in a strikingly similar order. If such collections reflect typical ways of how religious genres were received, they should form part of a corpus of religious prose.

4. Conclusion

Without doubt, one of the most attractive aims of English diachronic corpus linguistics is the creation of a large mega-corpus covering the whole history of the English language (see Rissanen 2000:13). Although this is a very respectable goal, we may still be far away from it. In my view, diachronic corpus work during the next few years should focus on the creation of new fat specialised corpora in different domains and periods. Once their internal structure and representativeness has been further improved, one may find means and parameters for linking them across genres and centuries, thus approaching a large corpus covering the whole history of the English language. But for the time being, the implementation of effective links between existing and projected corpora will be among the greatest challenges of diachronic corpus design.


[1] Kennedy gives a rather short account (1998:38-40); so does Meyer (2002:37-38; 46).

[2] Descriptions of more diachronic corpora may be found in Kytö, Rissanen and Wright (1994), Kytö and Rissanen (1997) and Rissanen (2000). See also the historical corpora listed on

[3] On the Helsinki Corpus see Nevalainen and Raumolin-Brunberg (1989), Rissanen, Kytö and Palander-Collin (1993), Rissanen (1994), Kytö (1996), Rissanen, Kytö and Heikkonen (1997a, 1997b).

[4] For example, there are large gaps in the text type "fiction" between Chaucer's Canterbury Tales (c 1395) and Caxton's Reynard the Fox (1481) and between Armin's Nest of Ninnies (1619) and Aphra Behn's Oroonoko (1688).

[5] On the Corpus of Early English Correspondence see Nevalainen and Raumolin-Brunberg (1996, 2003) and Nurmi (1999). At present, an Extension and a Supplement of the CEEC are prepared. The Extension will cover the years 1653 till 1800, while the Supplement will add some 860 letters for the 15th, 16th and 17th centuries.

[6] Among the parsed versions of (parts of) the Helsinki Corpus there are The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English, The York-Helsinki Parsed Corpus of Old English Poetry, The York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE), the Penn-Helsinki Parsed Corpus of Middle English (PPCME2) (see Kroch and Taylor 2000) and the Penn-Helsinki Parsed Corpus of Early Modern English.

[7] On ARCHER see Biber et al. (1994).

[8] Unfortunately, there is not enough space in this short overview to prove this claim. Suffice it to mention the pervasive power and prestige of the church throughout the Middle Ages and the paramount importance of religious writings in Tudor and Stuart England. In addition, scholars have repeatedly pointed to the great circulation of religious books and manuscripts during this period (see, for example, Deanesly 1920; Moran 1985:185-220; Bennett 1969; King 2000). Although the influence of religious prose declined after 1800, the major religious text types can still be found in Present-Day English. A genre like sermons or prayers could be traced from Old English till today.

[9] Some of these general functions can be found in standard accounts of text linguistics and text types. See, for example, Werlich (1983).

[10] For example, in homilies and other tracts on the core passages of the Bible (see Swanson 1995:16-19, 26-28 and Duffy 1992:53-87).

[11] Liturgical prayers are usually quite fixed as well, since they can only be changed by the appropriate ecclesiastical authority.

[12] On the degree of orality and literacy in the respective periods see, for example, Clanchy (1993), Cressy (1980) and Schofield (1968).


ARCHER = A Representative Corpus of Historical English Registers,

Bookmarks for Corpus-based Linguists by David Lee,

The Brooklyn-Geneva-Amsterdam-Helsinki Parsed Corpus of Old English,

Brown Corpus,

Corpus of Early English Correspondence,

Corpus of English Dialogues,

Corpus of English Religious Prose,

Corpus of Middle English Medical Texts,

Dictionary of Old English Corpus in Electronic Form. Centre for Medieval Studies. University of Toronto.

Helsinki Corpus = The Helsinki Corpus of English Texts. 1991. Helsinki: Department of English.

The Helsinki Corpus of Older Scots,

ICAMET = Innsbruck Computer Archive of Machine-Readable English Texts,

Lampeter Corpus of Early Modern English Tracts,

MED = Middle English Dictionary, electronic version. The Middle English Compendium. 1999, ed. by F. McSparran et al. Ann Arbor, Michigan: University of Michigan Press.

OED = Oxford English Dictionary Second Edition on Compact Disc, Version 1.13. 1994. Oxford: Oxford University Press.

PPCEME = Penn-Helsinki Parsed Corpus of Early Modern English,

PPCME2 = Penn-Helsinki Parsed Corpus of Middle English,

YCOE = The York-Toronto-Helsinki Parsed Corpus of Old English Prose,

The York-Helsinki Parsed Corpus of Old English Poetry,

ZEN = Zurich English Newspaper Corpus,


Bennett, H.S. 1969. English Books & Readers 1475 to 1557. (2nd ed.) Cambridge: Cambridge University Press.

Biber, Douglas, Edward Finegan & Dwight Atkinson. 1994. "ARCHER and its challenges: Compiling and exploring a representative corpus of historical English registers". Creating and Using English Language Corpora. Papers from the Fourteenth International Conference on English Language Research on Computerized Corpora, Zürich 1993, ed. by Udo Fries, Gunnel Tottie & P. Schneider, 1-13. Amsterdam: Rodopi.

Biber, Douglas, Susan Conrad & Randi Reppen. 1998. Corpus Linguistics. Investigating Language Structure and Use. Cambridge: Cambridge University Press.

Clanchy, M.T. 1993. From Memory to Written Record. England 1066-1307. Oxford: Blackwell.

Cressy, David. 1980. Literacy and the Social Order: Reading and Writing in Tudor and Stuart England. Cambridge: Cambridge University Press.

Crystal, David & Derek Davy. 1969. Investigating English Style. Harlow: Longman.

Culpeper, Jonathan & Merja Kytö. 1997. "Towards a Corpus of Dialogues, 1550-1750". Language in Time and Space, ed. by Heinrich Ramisch & Kenneth Wynne, 60-73. Stuttgart: Steiner.

Culpeper, Jonathan & Merja Kytö. 2000. "Data in historical pragmatics: Spoken discourse (re)cast as writing". Journal of Historical Pragmatics 1(2): 175-199.

Deanesly, Margaret. 1920. "Vernacular books in England in the fourteenth and fifteenth centuries". The Modern Language Review 15: 349-358.

De Smet, Hendrik. 2005. "A corpus of Late Modern English texts". ICAME Journal 29: 69-82.

Duffy, Eamon. 1992. The Stripping of the Altars. Traditional Religion in England 1400-1580. New Haven: Yale University Press.

Fries, Udo & Peter Schneider. 2000. "ZEN: Preparing the Zurich English Newspaper Corpus". English Media Texts - Past and Present. Language and Textual Structure, ed. by Friedrich Ungerer, 3-24. Amsterdam and Philadelphia: John Benjamins.

Healey, Antoinette di Paolo. 1999. "The Dictionary of Old English Corpus on the World-Wide Web". Medieval English Studies Newsletter 40: 2-10.

Hoffmann, Sebastian. 2004. "Using the OED quotations data base as a corpus - a linguistic appraisal". ICAME Journal 28: 17-30.

Hunston, Susan. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.

Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. London: Longman.

King, John N. 2000. "Religious writing". English Literature 1500-1600, ed. by Arthur F. Kinney, 104-131. Cambridge: Cambridge University Press.

Kohnen, Thomas. 2004a. Text - Textsorte - Sprachgeschichte: Englische Partizipial- und Gerundialkonstruktionen 1100 bis 1700. Tübingen: Niemeyer.

Kohnen, Thomas. 2004b. "'Let mee bee so bold to request you to tell mee': Constructions with let me and the history of English directives". Journal of Historical Pragmatics 5(1): 159-173.

Kroch, Anthony S. & Ann Taylor. 2000. The Penn-Helsinki Parsed Corpus of Middle English PPCME2.

Kytö, Merja. 1996. Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts. (3rd ed.) Helsinki: Department of English, University of Helsinki.

Kytö, Merja & Matti Rissanen. 1992. "A language in transition: The Helsinki Corpus of English Texts". ICAME Journal 16: 7-27.

Kytö, Merja, Matti Rissanen & Susan Wright, eds. 1994. Corpora across the Centuries. Proceedings of the First International Colloquium on English Diachronic Corpora. Amsterdam: Rodopi.

Kytö, Merja & Matti Rissanen. 1997. "Introduction: Language analysis and diachronic corpora". Tracing the Trail of Time. Proceedings from the Second Diachronic Corpora Workshop, ed. by Raymond Hickey, Merja Kytö, Ian Lancashire & Matti Rissanen, 9-22. Amsterdam: Rodopi.

Kytö, Merja, Mats Rydén & Erik Smitterberg, eds. 2006. Nineteenth-Century English. Cambridge: Cambridge University Press.

Markus, Manfred. 1991. "Vorüberlegungen zur Herstellung einer (mittelenglischen) Volltextdatenbank". Arbeiten aus Anglistik und Amerikanistik 16(2): 159-173.

Markus, Manfred. 1994. "The concept of ICAMET Innsbruck Computer Archive of Middle English Texts". In Kytö, Rissanen & Wright (eds.), 41-52.

Markus, Manfred. 1999. Manual of ICAMET (Innsbruck Computer Archive of Machine-Readable English Texts). Innsbruck: Leopold-Franzens-Universität.

McEnery, Tony, Richard Xiao & Yukio Tono. 2006. Corpus-Based Language Studies. An Advanced Resource Book. London: Routledge.

Meurman-Solin, Anneli. 1993. Variation and Change in Early Scottish Prose. Studies Based on the Helsinki Corpus of Older Scots. Helsinki: Suomalainen Tiedeakatemia.

Meurman-Solin, Anneli. 1995. "A new tool: The Helsinki Corpus of Older Scots (1450-1700)". ICAME Journal 19: 49-62.

Meyer, Charles F. 2002. English Corpus Linguistics. Cambridge: Cambridge University Press.

Moran, Jo Ann Hoeppner. 1985. The Growth of English Schooling 1340-1548. Learning, Literacy, and Laicization in Pre-Reformation York Diocese. Princeton: Princeton University Press.

Mukherjee, Joybrato. 2004. "The state of the art in corpus linguistics: three book-length perspectives". English Language and Linguistics 8(1): 103-119.

Nevalainen, Terttu & Helena Raumolin-Brunberg. 1989. "A corpus of Early Modern Standard English in a socio-historical perspective". Neuphilologische Mitteilungen 90: 67-110.

Nevalainen, Terttu & Helena Raumolin-Brunberg, eds. 1996. Sociolinguistics and Language History. Studies Based on the Corpus of Early English Correspondence. Amsterdam: Rodopi.

Nevalainen, Terttu & Helena Raumolin-Brunberg. 2003. Historical Sociolinguistics: Language Change in Tudor and Stuart England. London: Pearson Education.

Nurmi, Arja. 1999. "The Corpus of Early English Correspondence Sampler (CEECS)". ICAME Journal 23: 53-64.

Rissanen, Matti. 1992. "The diachronic corpus as a window to the history of English". Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991, ed. by Jan Svartvik, 185-205. Berlin and New York: Mouton de Gruyter.

Rissanen, Matti. 1994. "The Helsinki Corpus of English Texts". In Kytö, Rissanen & Wright (eds.), 73-79.

Rissanen, Matti. 2000. "The world of English historical corpora: From Cædmon to computer age". Journal of English Linguistics 28: 7-20.

Rissanen, Matti & Merja Kytö. 1993. "General introduction". In Rissanen, Kytö & Palander-Collin (eds.), 1-17.

Rissanen, Matti, Merja Kytö & Minna Palander-Collin, eds. 1993. Early English in the Computer Age: Explorations through the Helsinki Corpus. Berlin: Mouton de Gruyter.

Rissanen, Matti, Merja Kytö & Kirsi Heikkonen, eds. 1997a. English in Transition. Corpus-based Studies in Linguistic Variation and Genre Style. Berlin: Mouton de Gruyter.

Rissanen, Matti, Merja Kytö & Kirsi Heikkonen, eds. 1997b. Grammaticalization at Work. Studies of Long-term Developments in English. Berlin: Mouton de Gruyter.

Schmied, Josef. 1994. "The Lampeter Corpus of Early Modern English Tracts". In Kytö, Rissanen & Wright (eds.), 81-89.

Schmied, Josef & Claudia Claridge. 1997. "Classifying text- or genre-variation in the Lampeter Corpus of Early Modern English texts". Tracing the Trail of Time. Proceedings from the Second Diachronic Corpora Workshop, ed. by Raymond Hickey, Merja Kytö, Ian Lancashire & Matti Rissanen, 119-135. Amsterdam: Rodopi.

Schofield, Roger S. 1968. "The measurement of literacy in pre-industrial England". Literacy in Traditional Societies, ed. by J. Goody, 311-325. Cambridge: Cambridge University Press.

Siemund, Rainer & Claudia Claridge. 1997. "The Lampeter Corpus of Early Modern English Tracts". ICAME Journal 21: 61-70.

Swanson, Robert N. 1995. Religion and Devotion in Europe, c.1215-c.1515. Cambridge: Cambridge University Press.

Taavitsainen, Irma. 2002. "Historical discourse analysis: Scientific language and changing thought-styles". Sounds, Words, Texts, Change: Selected Papers from the Eleventh International Conference on English Historical Linguistics, ed. by Teresa Fanego, Belén Méndez-Naya & Elena Seoane, 201-226. Amsterdam and Philadelphia: John Benjamins.

Taavitsainen, Irma. 2005a. "Genres and the appropriation of science: Loci communes in English literature in late medieval and early modern periods". Opening Windows on Texts and Discourses of the Past, ed. by Janne Skaffari, Matti Peikola, Ruth Carroll, Risto Hiltunen & Brita Wårvik, 179-196. Amsterdam and Philadelphia: John Benjamins.

Taavitsainen, Irma. 2005b. "Standardisation, house styles, and the scope of variation in Middle English scientific writing". Rethinking Middle English: Linguistic and Literary Approaches, ed. by Nikolaus Ritt & Herbert Schendl, 89-109. Frankfurt am Main: Peter Lang.

Taavitsainen, Irma & Päivi Pahta, eds. 2004. Medical and Scientific Writing in Late Medieval English. Cambridge: Cambridge University Press.

Taavitsainen, Irma, Päivi Pahta & Martti Mäkinen. 2005. Middle English Medical Texts. Amsterdam and Philadelphia: John Benjamins.

Werlich, Egon. 1983. A Text Grammar of English. (2nd ed.) Heidelberg: Quelle & Meyer.