A toolkit for constructing corpus networks [1]

Thomas Kohnen
University of Cologne


The aim of this paper is to look at the advantages and challenges of linking (diachronic) corpora and, thus, of constructing corpus networks. It indicates parameters which may serve as tools for determining whether corpora or sections of corpora can be compared and fruitfully linked. These parameters will be presented in the form of a toolkit for annotating (sections of) existing corpora. The parameters relate mostly to functional text structure (text functions, interactive format and publication format) and the position of texts within a genre or domain structure (hierarchies, sets and chains of genres). The article also looks at the possible interdependence of the different parameters and the various kinds of similarity they ensure.

1. Introduction

After the publication of the Helsinki Corpus several other diachronic corpora have been compiled (or are still in the process of compilation), many of them genre- or domain-based (for example, the Corpus of Early English Correspondence, Middle English Medical Texts, the Zurich English Newspaper Corpus, A Corpus of English Dialogues). Whereas some of these corpus projects have the aim to provide a larger and more comprehensive documentation for a specific genre or domain where the Helsinki Corpus contains only few data or none at all (for example, Middle English Medical Texts or the Zurich English Newspaper Corpus), others are designed as a basis for the development of a new research agenda (the Corpus of Early English Correspondence for the new field of historical sociolinguistics; see Nevalainen and Raumolin-Brunberg 2003).

The availability of many individual diachronic corpora, stemming from quite distinct, often unrelated domains of language use, raises the question how these could be combined in order to build up a more powerful data base and facilitate more reliable results. However, simply (and indiscriminately) "adding up" diachronic corpora will not necessarily help many researchers. Those researchers, for example, who are interested in the interactive format of texts and would like to find second-person pronouns directly appealing to the addressee of the text would have to browse millions of words and check manually thousands of items in order to get two or three hundred relevant tokens. Instead, they would probably prefer a smaller but more relevant data set to start with. This could be done by selecting and linking suitable parts of existing corpora which may meet more exactly the requirements of the study (in this case, the specific interactive format of a text). But how do we know that certain corpora or sections of them are actually comparable and can thus be combined to form a coherent data base for a given research task? What are the parameters that might provide the necessary metadata for corpus users to decide whether sections of corpora may be linked or not?

The aim of this paper is to look at the advantages and challenges of comparing and linking corpora. In particular, it intends to specify parameters which may serve as tools for determining whether corpora or sections of corpora can be compared and which parts may be fruitfully linked. I will present these parameters in the form of a toolkit for annotating (sections of) existing corpora and thereby deriving relevant metadata for constructing corpus networks of different size and dimensions. The parameters included in this toolkit relate mostly to text functions and the position of texts within a genre or domain structure. Thus, they are quite different from existing general coding guidelines that primarily aim at formal properties of text structure. [2]

My paper falls into four parts. After a short survey of previous research I will focus on two kinds of parameters: those relating to functional text structure (with text functions, interactive format and publication format as major categories) and those relating to the domain structure or network structure (with hierarchies, sets and chains of genres as major categories). In the conclusion section I will shortly address the issues of the possible interdependence of the different parameters and the various kinds of similarity they ensure.

2. Previous research

As far as I know, not much research has been devoted to the question in how far different corpora, in particular diachronic corpora or parts of them, can be combined to form larger data bases or corpus networks. However, the desire to make corpora transparent and to make explicit many features that text samples have or have not in common can already be found in early historical corpora, for example, the Helsinki Corpus. Here, the compilers included basic relevant information for comparing or linking text samples in the text headers of all files (the so-called textual parameters; see Kytö 1996, Chapter 3.3.4.). These parameters give basic information about the text sample (for example, about date, dialect, genre, prototypical text category, formality and interactive quality) as well as essential facts about the author. However, the information given here is in most cases fairly rudimentary and some broad descriptions (for example, the prototypical text category) may sometimes even be misleading (on this see below).

In a recent project, the European Database of Descriptors of English Electronic Texts (EUDDEET; Diller, De Smet and Tyrkkö 2010), these text parameters have been used as a basis for a stable set of categories for describing historical texts that are freely available on the web. In this approach the primary aim is not linking existing corpora but helping researchers to create their own corpora, once the texts are available with the relevant descriptions / descriptors. The set of descriptors listed in the 2010 publication relies heavily on the textual parameters used in the Helsinki Corpus. It comprises: author’s name, prototypical text category, genre and subgenre (here the Helsinki Corpus had 'text type'), verse/prose, written/spoken, author’s sex, author’s birth, date first published, author’s death, with the addition of the website where the text is found and some individual notes.

Without doubt, these descriptors can reveal important similarities and differences between texts or text samples, but at the same time some distinctions and links between texts in corpora will go unnoticed, especially those which are based on the text-functional and domain-based parameters. These parameters seem to influence the linguistic and functional profile of texts quite considerably and thus determine the extent to which they are similar and can be fruitfully linked.

3. Parameters of (functional) text structure

3.1 General text functions

What I call general text functions can be defined as functions of texts which recur inside texts and which may be typically associated with sections of texts (the basic notion goes back to Werlich 1983; for a more detailed account see Kohnen 2007, 2010 and Rütten 2011). For example, in texts which belong to the sphere of religious instruction, we typically find text sections devoted to exhortation, exposition, exegesis, narration, and sometimes argumentation. Thus, a sermon may comprise portions of text devoted to telling people what to do or not to do (exhortation), to telling stories from the Bible and interpreting them (narration and exegesis) and to giving more systematic accounts of the Christian doctrine (exposition).

These recurring text functions may be domain-specific or not. For example, the particular function of exegesis and the special combination of narration, exegesis and exposition may be typical of the religious domain. On the other hand, most of these functions recur in all texts in all domains and thus have a nearly universal nature.

The important point about general text functions in the context of linking text files of different corpora is that the predominant text function of a text or text section often determines the frequency and distribution of linguistic features in this text. Thus, a text with predominantly narrative sections will have different features than a purely expository text. This correlation between textual functions and the distribution of linguistic features has been shown on a broad statistical basis in Biber’s seminal work (Biber 1988), in particular with regard to his notion of textual dimensions. [3]

While there is a clear correlation between general text function and linguistic form (or, in Biber's terms between textual dimension and the distribution of linguistic features), there is hardly any clear association between text function and genre. Texts belonging to the same genre (or the same text type in the terms of the Helsinki Corpus) may be quite idiosyncratic with regard to the distribution of general text functions and text functions may co-occur in basically unforeseeable ways, with changing combinations and proportions. Thus, in the context of the Helsinki Corpus, texts that belong to the same text type or prototypical text type (for example, religious instruction or non-imaginative narration) may show major differences in terms of the occurrence and proportions of general text functions. For example, texts which belong to the sphere of religious instruction may be predominantly exhortative and narrative (for example, sermons) or predominantly expository and (in addition) exhortative (for example, treatises; on this see also Rütten 2011). On the other hand, texts which are associated with different prototypical text types may be quite similar in terms of general text functions. For example, early petitions as well as sermons and fiction, that is, texts stemming from completely different prototypical text types, may all contain long narrative sections, that is, they share one principal identical text function.

Example (1) below illustrates the predominant exhortative and narrative functions (with a short section of exposition) in an excerpt from an Early Modern sermon; example (2) contains prevalent expository sections, interrupted by small exhortative parts, in a Middle English religious treatise; example (3) shows a long narrative section in a late Middle English petition. [4]

(1) <exhortation> Lett vs therefore Christen people for whome Chryste hathe thus shedde his bloode, gyue tha~kes for his passion, lett vs ioye in his resurreccio~, lett vs laude hym for his ascencio~, for all y=t=  hoole catholique chirche thrughoute the worlde dothe reioyce therein: besechinge him that dydde this moche for vs, that he wyll of his mercyfull godenes brynge and make vs tascend in to thys hys glorye. Lett vs Chrysten people reioyce y=t= cryste was borne, to teache vs, that he died, to heale vs. For his crosse was deathe to hym, & to vs lyffe. [Continues in Appendix]
(2) <exposition> Whanne we þoru3 Goddis grace þese lettyngis haue fordon and oure hertis stablisschid, þan may we hope þat vs schal come þat we in preyer biseche. In hope þus vs setteþ oure Lord whan he lerneþ vs to calle <p11> hym oure fadir þat is in heuenes, ffor in hym men owen to haue certeyn hope þat may 7 wole alle goodis 3yue þat oure soule [3erneþ], þe which is vndirstonden þoru3 þis word: Pater noster, þat is: oure fadir. And þe power þoru3 þis word: qui es in celis: þat is in heuenes. In as myche as God techiþ vs to calle hym oure fadir, in þat he makiþ vs to vndirstonde þat he loueþ vs as his dere childre and þat he wole 3yue vs of his goodis aftir we haue need. </exposition> [Continues in Appendix]

<1> A Roy nostre souerain seigneur
Besechen humbly youre Comunes of this present <2> parliament. that
<narration> where one Iohn Carpenter of Brydham in the Shire of Sussex husbund-man the vii daye of <3> Fevever the yere of youre noble reigne the viiite saying to Isabell his wijff that was of the Age. of xvje. <4> yere and had be maried to him but xv dayes. that they wolde go to gedre on Pilgremage and made to arraye hir in <5> hir best arraie and toke hir with hym fro the said Toun of Brydham to the Toun of Stoghton in the <6> said Shire. And there in a woode he smote the said Isabell his wijff on the hede that the brayne wende oute <7> and with his knyff yaf hire many other dedly woundes. [Continues in Appendix]

So it is likely that the indication of the basic genre or prototypical text category does not give reliable information about the prevailing general text functions and, consequently, about the linguistic forms that are to be expected in the texts (for example, one would not expect so many past tense forms typical of narrative sections in a statutory text).

Thus, the specification of the predominant general text functions of the text samples contained in a corpus is an important prerequisite for establishing reliable links between corpora and combining sections which should be comparable. However, it seems that in most diachronic corpora these specifications are lacking and researchers will have to approach the text samples by reading them and testing the principal general text functions.

3.2 Interactive format

Another important functional parameter relevant for linking corpora is the interactive format of texts or text samples. Central dimensions of the interactive format of texts are whether they are presented in the form of a dialogue or a monologue, and – if it is a monologue – whether they have a strong orientation towards the reader or not. Thus, a handbook on witches may take the form of a fictional dialogue or simply expound relevant facts. A religious treatise may be written in the form of a letter and directly appeal to the recipient or the author may only state and explain facts without any explicit address to the reader.

It goes without saying that the interactive format of a text has important consequences for the frequency and distribution of several linguistic features (for example, personal pronouns, address terms, deictic elements, first- and second-person verb forms, interrogative sentences, elliptical clauses etc.) and whether they refer to a fictional context or a context shared by the addressor and the reader of the text.

The excerpt from a religious treatise included in example (4) below is cast in the form of a fictional dialogue, with many second-person pronouns that aim at the interactional partner in the fictional dialogue.


Of the Conversion of a Sinner; What it is?
Speakers. Paul, A Teacher
                   Saul, A Learner,
Paul. Well Neighbour; Have you examined your self by the word of God, since I saw you, as I directed you?
Saul. I have done what I can in it.
P. And what do you think now of your case upon tryal?
S. I think it is much worse than I had hoped it was; and as bad as you feared: …

(Richard Baxter, The poor man's family book, 1674; COERP)

Thus, the interactive format is an important parameter to be included in the metadata for linking corpora, in particular, since fictional dialogues or letter formats appear in historical texts quite unexpectedly. Apart from the above example, fictional dialogue is found in Middle English and Early Modern English handbooks and secular treatises. The letter format occurs in many Middle English treatises as well, with a strong reader orientation. The same applies to prefatory material attached to handbooks, catechisms and collections of treatises. Here, the relevance of the interactive format can be seen not only in the frequency of address terms and deictic elements in the texts but also in their differing reference. Whereas in letters and prefaces they will in most cases refer to the recipient and his / her world, they will remain "within the text" in most fictional dialogues.

3.3 Compilation / publication format

Another relevant parameter which belongs to the functional text structure involves the compilation or publication format of the text or text sample. This parameter determines the "neighbourhood" or "background" of a text, whether it is part of a larger manuscript containing other texts, whether it is part of a so-called commonplace book, whether it was published as a pamphlet, whether it can be found in a newspaper or journal, whether it is part of a published collection of texts or whether it is simply part of one book.

The compilation or publication format of a text is quite relevant when text samples of different corpora are linked, since it may affect the structure and function of texts. For example, commonplace books (that is, collections of texts which were compiled for future reference or further use, containing a wide range of different texts and genres) were designed as multifunctional reservoirs which could be reused in multi-layered communication practices involving role shifts in text communication and changes of text function (see Kohnen 2011). Thus, a law text or a piece of religious instruction contained in a commonplace book could have differing addressors and addressees or serve different text functions, depending on how they were re-used (for example, the compiler could be the recipient or the addressor of a text of religious instruction). The publication type "pamphlet", serving as a platform for publishing a wide variety of different genres, had an impact on the form and function of the respective genre, for instance, by making petitions or letters more interactive and less formulaic (see Claridge 2000 and Groeger 2010). In a similar way it is important to know whether a text sample stems from a newspaper and to which of the several sub-genres it belongs (for example, news section, obituary, advertisements etc.).

4. Parameters of domain structure

Now I turn to parameters that relate to the domain structure underlying texts and genres. The concept of a domain structure or network structure of genres goes back to the investigation of research genres by John Swales (2004) and the work on genre dynamics in historical medical texts by Irma Taavitsainen (2009, 2010), but also to research on religious genres in the context of the compilation of the Corpus of English Religious Prose. It includes the idea that the genres of a particular domain (for example, religious discourse, discourse of science, discourse of mass media) form a structured network that can be captured in a systematic fashion. [5] Main elements of the domain structure are hierarchies (different basic constellations of discourse participants involved in the communication; Kohnen 2010), sets (groupings of genres according to their relative position in the discourse community) and chains (chronological or logical sequences of genres within a set; Swales (2004)). Several large corpus projects are in fact domain-based (for example, the Corpus of Early English Medical Writing and the Corpus of English Religious Prose).

4.1 Hierarchies

In the present context, hierarchies are groups of genres ordered according to different basic constellations in which the discourse participants communicate with each other (for example, a regulation issued by a king for his subjects involves a different constellation or hierarchy than personal letters exchanged by fellow-students).

Among the hierarchies, I would like to distinguish a first-order, a second-order and a third-order sphere. The first-order sphere contains texts that are issued by a superior, binding body or authority and are directed at all the members of the discourse community. Typical examples are laws in the administrative domain or God’s word as seen in the canon of the Biblical writings in the religious domain. The second-order sphere contains texts which work the other way round. Here the members of a discourse community address a superior authority or institution. Typical examples are petitions to higher institutions in the administrative domain and different forms of prayer in the religious domain. The third-order sphere contains texts with which members of a discourse community, who do not form a superior body or institution, communicate with each other. Typical examples are private letters, sermons, handbooks, etc.

Let us take a closer look at the first-order sphere, with a superior, binding body or authority addressing all the members of the discourse community. In the religious domain we find here the Bible (conceived as God’s word); in the administrative domain there are statutes and laws (issued by the legislative, that is law-making, authority directed at the community of subjects or citizens); in the domain of science and education there are, for example, fundamental treatises of established medical authors or essential text books which in their respective periods are held to contain incontestable truths or – later in the history of science – common views for the time being established in the discourse community. All these texts and genres have in common that they apply to the whole of the discourse community.

Typical genres which belong to a first-order hierarchy share several features which are relevant when linking or comparing text samples from different corpora. Such texts usually enjoy a high prestige and serve a central function in their respective discourse domains. They are often used as a fundamental point of reference and a common basis for justification (for example, the texts of the Biblical tradition or laws). Also, these texts are often found cited in other texts (of the same and other domains). Lastly, they are not liable to change (unless, in the case of the Bible, a new translation is made, or, in the case of statutes, a new law is passed). When comparing and combining sections from different corpora, researchers need to be able to identify this central, privileged position of texts from the first-order sphere.

It is important to see that in most domains of language use membership of the first-order sphere does not seem to be fixed, but may be blurred and change. For example, in the domain of science established authors and texts may lose their central position in the discourse community and be replaced by others. (Taavitsainen 2010 describes such a shift in the domain of science, which was, however, rather slow, after the foundation of the Royal Society in 1660, with new prestigious genres evolving (experimental reports, book reviews) and new stylistic guidelines being set up for other genres). In addition, the relevant texts in the first-order sphere in earlier periods of the history of English may not necessarily be in English, but mostly in Latin (sometimes in Greek or French). For example, in the religious domain the Latin Vulgate version of the Bible was predominant (and officially prescribed) until the first decades of the 16th century and in the domain of medicine most accepted texts were Latin right until the 16th century (see Taavitsainen 2010).

Now to the second-order sphere, with members of a discourse community addressing a superior authority or institution. Typical examples in the religious domain are the various forms of prayer, both liturgical and private. Here members of the Christian community address the superior, transcendental authority of God. In the administrative domain we are dealing with genres of the second-order sphere when subjects or citizens address superior legal institutions (for example, using petitions or official letters, possibly also in wills when appealing to a higher authority). In the domain of science we may include letters sent by private researchers to a super-ordinate scientific body or authority (for example, the Royal Society before publication in the transactions).

The common feature of texts belonging to the second-order sphere is that they appeal to a higher authority, which may result in similar text structures and several common linguistic features (for example, a text section devoted to a petition, address terms, forms of directives, forms of positive and negative politeness etc.). The similarities (but also differences) may be seen when comparing a petition with a private prayer (see Kohnen 2008).

The third-order sphere, with members of the discourse community addressing other members of the discourse community, seems to comprise a large variety of genres in all domains. For example, in the religious domain there are the genres which belong to religious instruction or theological discussion; in the administrative domain we can include agreements between different parties or handbooks on legal advice; we also find handbooks in various other domains and on different subjects; different kinds of (private) letters issued either from a lower or a higher social rank also belong to this sphere. The important feature shared by all these genres is that members of a discourse community who do not form a superior institution communicate with each other.

In some cases it seems difficult to decide whether members of a discourse community act as part of an established institution or not. For example, in the religious sphere the writings of the Church Fathers and certain other official doctrinal documents (for example, in the Catholic Church issued by the Pope or a Council) would hardly belong to the third sphere. In the domains of science and administration the assignment of particular texts (or authors) may be quite difficult (see also Taavitsainen 2010). By contrast, one of the special characteristics of the private domain seems to be that it only comprises genres of the third sphere.

In addition to the large variety of texts and genres found in the third sphere, it is a special characteristic of this sphere that its texts do not seem to share particular textual or linguistic features. Thus, it may be that the category is too broad and other domain-specific features are needed in order to establish finer links. For example, in the third sphere of the religious domain there is the additional distinction between religious instruction and theological discussion. It remains to be seen whether parallel distinctions can be drawn in other domains and what further divisions can be introduced.

4.2 Sets

Genre sets are groupings of genres which reflect their relative (that is, more central or more peripheral) position in the discourse community. The basic distinction is between core, minor and associated genres. Core genres are genuinely rooted in the domain (for example, prayers and catechism in the religious domain, laws in the administrative domain). Minor genres are similarly rooted in their respective domain, but they do not apply to all members of the discourse community since they are used only in special institutions or by specially qualified people (e.g. monastic rules and liturgical prayer in the religious domain; law commentaries or summons in the administrative domain). Associated genres originally exist outside their respective domain and are not genuinely part of it but become associated with it at some point in time due to cultural or technological developments (for example, pamphlets and prefaces as a result of the printing press). Set membership is a highly relevant parameter when specifying the metadata for linking sections of different corpora. It is, of course, important to know whether a given text stems from a core, a minor or an associated genre in its respective domain since this has major consequences for their spread and their audience, but also for the linguistic features found in them.

A particularly interesting set, which may show comparable properties across different domains, is the group of the associated genres. Prototypical associated genres are prefaces and pamphlets. Both may be found in various domains. For example, the religious domain includes prefaces to collections of treatises or sermons, to catechisms and religious biographies; similarly, the domain of science offers prefaces to treatises and handbooks; in the administrative domain we find prefaces to collections of statutes and commentaries. On the other hand, pamphlets form a platform for publishing letters, sermons, petitions, treatises and many other genres from various domains (see also the different subdivisions of the Lampeter Corpus in terms of the domains religion, politics, economics / trade, science and law; see Schmied and Claridge 1997).

Preliminary research with the data of the Corpus of English Religious Prose has shown that associated religious genres differ from core genres in that the distribution and frequency of typical religious features is closer to that of secular genres (Kohnen forthcoming). Against this background it seems likely that associated genres of different domains show certain common features and it may be rewarding to compare them across different areas of language use. At any rate, it should not go unnoticed that a particular associated genre may have equivalents in completely unrelated fields.

4.3 Chains

Another relevant parameter which should be considered when linking corpora are genre chains. The term 'genre chain' was introduced by John Swales (2004) in order to capture the chronological and logical sequences of genres in scientific writing. For example, a review both chronologically and logically follows the work which is reviewed. In the present context, one significant feature seems to be whether a genre starts a chain or constitutes the endpoint of a chain. Generally, the chain-initiating status of a genre is often associated with a specific interactive format, namely the dialogue and letter format, especially in Middle English and Early Modern English. For example, in the religious domain, catechisms, which have to be seen within an initial chain relation to the other (more "advanced") core genres of religious instruction, typically show a dialogue format. In a similar way, we find dialogues in chain-initial genres in several other domains (for example, accessible medical treatises (see Taavitsainen 1999), straightforward handbooks on a variety of topics (on this see, for example, several handbook excerpts in the Early Modern section of the Helsinki Corpus)). Similarly, letter formats in non-correspondence texts often form starting points of genre chains in that they give first and straightforward instruction (a good example is William Cobbett's English Grammar in the form of letters addressed to his son; Cobbett 1983).

Thus, it seems likely that the chain position of a genre has important consequences for the textual and linguistic structure of its texts. It should be included in the parameters which are relevant for linking and comparing corpora.

5. Conclusions

The aim of this paper was to explore important elements of a toolkit which may provide faithful standards of comparison when linking domain-based corpora (or parts of them) and thus form a basis for constructing corpus networks. My emphasis was on two kinds of parameters, text-functional and domain-based parameters (see Table 1 below). Text-functional parameters relate primarily to the structure of the text or text sample in a corpus, whereas domain-based parameters determine the relative position of the respective genre / text within the discourse community.

text-functional parameters domain-based parameters
general text functions
(e.g. narration, exposition, argumentation etc.)
interactive format
              reader orientation
              no reader orientation
        core genres
        minor genres
        associated genres
compilation / publication format
        manuscript collection
        commonplace book
        catechism -> sermon -> religious treatise
        medical dialogue -> medical treatise

Table 1. Text-functional and domain-based parameters

It is important to see that the two kinds of parameters open up quite diverse perspectives on texts and text samples in a corpus and thus may imply quite different kinds of similarity. Text-functional parameters ensure above all similarity of linguistic structures, whereas network parameters mostly ensure similarly situated genres (which then, of course, may share certain linguistic features). For example, a similar interactive format suggests a comparable frequency and distribution of interactional linguistic features (personal pronouns, deictic elements, address terms etc.). Membership of a first-order sphere implies high prestige and great intertextuality of the text in comparison to the other texts / genres in the discourse domain.

Texts which belong to the first-order sphere will not necessarily be similar in terms of linguistic form. Rather, they will represent the idiosyncratic linguistic features of the respective domain in the purest form (for example, the typical administrative style in statutes or the typical religious register in Bible translations). Since the differing idiosyncratic features of a domain-based style may be quite incommensurable and also resistant to change, first-order genres may not necessarily be comparable in these linguistic respects at all but form interesting contrasts. On the other hand, associated genres of the third-order hierarchy will not represent the typical linguistic features of their domain to such an extent and may include more "common" linguistic features that are less domain-specific.

Also, the network parameters typically interact with each other. Hierarchies may imply a specific set membership and chain position and vice versa. For example, if a text or genre belongs to a first-order sphere (laws, the Bible or fundamental academic text books), it will typically form a core genre in the domain and often be non-initial. By contrast, associated genres typically belong to the third-order sphere.

These intricate relationships and contrasts nicely illustrate that the parameters included in this toolkit may in fact cover quite diverse research interests and may offer wide possibilities of contrastive and comparative analyses. On the one hand the toolkit allows for the construction of rather specific and restrictive compilations (for example, collecting texts with address terms only referring to the addressee of the texts), on the other hand it might serve as the basis for constructing comprehensive corpus networks that facilitate broad comparisons across domains.

One obvious requirement which the model envisaged in this article should meet in the future is application in more domains. So far illustrations were given for the religious domain, the administrative domain and the domain of science. One particular problem seems to be that in some domains (for example, the private domain) there are no first- and second-order spheres. In other domains (mass media, education, economy) more detailed research is necessary to trace the hierarchies in the multi-layered relationships between institutions, text traditions and language users.

A concrete first step for testing the practicability of the suggested toolkit and its different parameters might be a systematic comparison of the Early Modern part of the Corpus of English Religious Prose (COERP) and the Early Modern English Medical Texts (EMEMT) corpus. Both COERP and EMEMT are domain-based corpora and here the feasibility both of the text-functional and the domain-based parameters in establishing systematic and fruitful links between texts from different domains can be assessed. A complementary test could be conducted with the various pamphlets contained in the Lampeter Corpus. Here we are dealing with one associated genre with texts from different source domains. These and other practical analyses could help to create the first outlines of a diachronic corpus network but would also refine and complete the parameters needed for constructing it.


Corpus of Early English Correspondence, http://www.helsinki.fi/varieng/domains/CEEC.html

Corpus of English Dialogues, http://www.engelska.uu.se/forskning/engelska-spraket/elektroniska-resurser/a-corpus

Corpus of English Religious Prose, http://anglistik1.phil-fak.uni-koeln.de/7042.html

Corpus of Middle English Medical Texts, http://www.helsinki.fi/varieng/CoRD/corpora/CEEM/

Helsinki Corpus = The Helsinki Corpus of English Texts. 1991. Helsinki: Department of English. http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/

Lampeter Corpus of Early Modern English Tracts, http://clu.uni.no/icame/manuals/LAMPETER/LAMPHOME.HTM

ZEN = Zurich English Newspaper Corpus, http://es-zen.unizh.ch


[1] High-resolution images of sacred texts can be seen at the British Library Online Gallery, for example the Psalter of Henry VIII.

[2] See, for example, the P5 Guidelines of the Text Encoding Initiative. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html

[3] That is, the functionally interpreted co-occurrence patterns of selected morpho-syntactic features. Note, however, that the text functions presented here are not defined by linguistic features but based on the account given in Werlich (1983). Also, the number and range of Biber's textual dimensions is quite different from the general text functions presented here.

[4] In the following extracts, exhortation is highlighted in red, exposition is highlighted in green, and narration is highlighted in yellow.

[5] The notion of domain is also used for classifying texts in the British National Corpus. However, there it is used as a text categorisation for written texts only, with the basic subdivisions of "imaginative" and "informative" writings (with several subfields) (see http://www.natcorp.ox.ac.uk/corpus/creating.xml). The term as it is used in this study is much broader, including also the socially defined institutions and frameworks for the formulation and dissemination of texts.


Biber, Douglas. 1988. Variation Across Speech and Writing. Cambridge: Cambridge University Press.

Claridge, Claudia. 2000. “Pamphlets and Early Newspapers: Political Interaction vs. News Reporting”. English Media Texts – Past and Present. Language and Textual Structure. Pragmatics and Beyond New Series 80, ed. by Friedrich Ungerer, 25–43. Amsterdam: Benjamins.

Cobbett, William. 1983. A Grammar of the English Languge. Amsterdam: Rodopi.

Diller, Hans-Jürgen, Hendrik De Smet & Jukka Tyrkkö. 2010. “A European database of descriptors of English electronic texts.” The European English Messenger 19(2): 29–35.

Groeger, Dorothee. 2010. The Pamphlet as a Form of Publication. A Corpus-based study of Early Modern Religious Pamphlets. Aachen: Shaker.

Kohnen, Thomas. 2007. “From Helsinki through the centuries: The design and development of English diachronic corpora.” Towards Multimedia in Corpus Studies, ed. by Päivi Pahta, Irma Taavitsainen, Terttu Nevalainen & Jukka Tyrkkö. Helsinki: Research Unit for Variation, Contacts and Change in English. (Studies in Language Variation, Contacts and Change in English, Vol. 2). http://www.helsinki.fi/varieng/series/volumes/02/kohnen/

Kohnen, Thomas. 2008. “Tracing directives through text and time. Towards a methodology of a corpus-based diachronic speech-act analysis.”  Speech Acts in the History of English., ed. by Andreas H. Jucker & Irma Taavitsainen, 295–310. Amsterdam: John Benjamins.

Kohnen, Thomas. 2010. “Religious Discourse”. Historical Pragmatics, ed. by Andreas H. Jucker & Irma Taavitsainen, 523–547. Berlin: Mouton.

Kohnen, Thomas. 2011. “Commonplace-book communication. Role shifts and text functions in Robert Reynes’s notes contained in MS Tanner 407.” Communicating Early English Manuscripts, ed. by Päivi Pahta & Andreas H. Jucker, 13–24. Cambridge: Cambridge University Press.

Kohnen, Thomas. forthcoming. “Religious discourse and the history of English”. Proceedings from the IAUPE Conference on Malta.

Kytö, Merja, comp. 1996. Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts. 3rd ed. Helsinki: Department of English, University of Helsinki.

Nevalainen, Terttu & Helena Raumolin-Brunberg. 2003. Historical Sociolinguistics: Language Change in Tudor and Stuart England. London: Longman.

Rütten, Tanja. 2011. How to Do Things with Texts. Patterns of Instruction in Religious Discourse 1350–1700. Bern: Peter Lang.

Schmied, Josef & Claudia Claridge. 1997. “Classifying Text- or Genre-Variation in the Lampeter Corpus of Early Modern English Texts”. Tracing the Trail of Time. Proceedings from the Second Diachronic Corpora Workshop, ed. by Raymond Hickey, Merja Kytö, Ian Lancashire & Matti Rissanen, 119–35. Amsterdam: Rodopi.

Swales, John. 1990. Genre Analysis. English for academic and research settings. Cambridge: Cambridge University Press.

Swales, John M. 2004. Research Genres. Exploration and Applications. Cambridge: Cambridge University Press.

Taavitsainen, Irma. 1999. “Dialogues in Late Medieval and Early Modern English medical writing”. Historical Dialogue Analysis, ed. by Andreas H. Jucker, Gerd Fritz, & Franz Lebsanft, 243–268. Amsterdam/Philadelphia: Benjamins.

Taavitsainen, Irma. 2009. “The pragmatics of knowledge and meaning”. Corpora: Pragmatics and Discourse. Papers from the 29th International Coonference on English Language Research on Computerized Corpora, ed. by Andreas H. Jucker, Daniel Schreier & Marianne Hundt, 37–62. Amsterdam: Rodopi.

Taavitsainen, Irma. 2010. “Discourse and genre dynamics in Early Modern English medical writing”. Early Modern English Medical Texts. Corpus Description and Studies, ed. by Irma Taavitsainen & Päivi Pahta, 29–53. Amsterdam & Philadelphia: Benjamins.

Werlich, Egon. 1983. A Text Grammar of English. (2nd ed.) Heidelberg: Quelle & Meyer.


(1) <exhortation> Lett vs therefore Christen people for whome Chryste hathe thus shedde his bloode, gyue tha~kes for his passion, lett vs ioye in his resurreccio~, lett vs laude hym for his ascencio~, for all y=t=  hoole catholique chirche thrughoute the worlde dothe reioyce therein: besechinge him that dydde this moche for vs, that he wyll of his mercyfull godenes brynge and make vs tascend in to thys hys glorye. Lett vs Chrysten people reioyce y=t= cryste was borne, to teache vs, that he died, to heale vs. For his crosse was deathe to hym, & to vs lyffe. His natiuitie was ioye to all the worlde. His lyffes doctrine was to vs lyght & co~forte: his passion & deathe was to vs lyffe and solace: his resurreccion was to vs ioye and glorye. For after he was deade & wente to the helles, he reuerted ayen to the worlde and tryumphed. For he appered to sainte thomas & saide, putt thy fingre in to the wounds of my syde, touche my wounds, see howe that oute of them ranne bloode. Loke thomas vpon y=t= pryce of y=t= worlde, looke vpo~ y=t= sygnes of the nayles, for in those wou~des shalte y=t= fynde remeadyes & healthe for all y=t= sores of thy soule.

And to you y=t= Jewes, scrybes, and pharysees: beholde you this sepulcre, & recognice your sacrilege co~mitted annenste your lorde God beholde


you the grette crosse, beholde you y=t= roughe nayles, beholde the sharpe spere. Beholde the soore greuoouse woundyd bodye, whyche ye prykled, whyche ye scourged, whiche ye wounded & enuiously encreated. Beholde you y=t= sepulcre in whyche he was buryed. It is voyde, the bodye is rysen and ascended to the fadre, to your confusion & dampnacion. For Chryste reseruethe fixuras clauoru~m the markes and scarres of the nayles & wou~des, to shewe vnto you att that greate day, your sacrilegiouse tyra~nye whiche ye haue done vnto y=t= bodye, y=t= ye maye ther see your deades, that ye maye ther see your myscheuouse actes, your abhominable doyngs, to your own confusion, shame, and dampnacion. O pharyses, scrybes, and Jewes: consydre your malyce: consydre your malignitye commytted annenste your maker chryste. Reuoke your malice, putt awaye your enuiouse hartes, do ye penau~ce, wepe and wayle your synnes, crye to this crucyfied chryste that he maye remytte and forgyue

you your offenses.

And you chryste~ people, come ye nere, Joye ye and co~forte your selues in this chryste & god, in this sauiour of the worlde. Studye you to lyue in hym, to lyue in a sobrenes, in a clennes & chastitye, to lyue chrystianely, godly and vertuously. </exhortation>

<exposition> Chrysten ma~ thou haste harde nowe, howe many wayes criste hathe opened & shewed him selfe


to the worlde, and proued hym selfe veraye god. And namely in his passion & deathe, in his desce~cion to hell, in his resurreccion: and principally whenne he ascended vnto the heuens bodely, & ther sytteth in dextera patris on the ryghte hande of the fadre. </exposition>

<narration> And afterwarde accordinge to his promyse se~te downe the holye gooste in symilitude of fyerye tongues vpon the Pentecoste daye emo~ges his disciples: whyche gaue them veraye knowledge of chryste to be god. They wente thenne streight forthe abroode & preached chryste, and taughte hym to the worlde. At whose words the people came in, they ranne to penaunce, they toke baptysme, they beganne to smake of god, they lame~ted their synnes, they cryed for mercye, they bega~ne a Chrystian lyff. And to declare Chryste abrode to all the worlde, the disciples deuided the~ selues into sundrye partes and countreyes: Peter into Jerusale~, Antioche, and Rome. Paule into Rome, Damaske, Atheyns, Galathy, Corynthy, Sythia & Tracy. Methewe into Ethyope, Thomas into India Inde. Bartholomewe into the other India. Andrewe into Achaia & Jerusale~, John into Asia. Bothe the James, in terra~ Jude into the lande of Jude. And the resudue of the~ after ther lymytacion & lotte: some into oon cou~trey, and some into an other: so to edyfye chryste his faythe, so to buyld his chirche, so to shewe


god to the worlde. And annon Chryste grewe in knowledge to the worlde, annon the people beganne to knowe hym: to smake and sauour of hym; butt not in all parts. For after that Paule was conuerted & wente abrode preachinge christe, he came into the cyte of Athenes. And as his custome was, euer whenne he came into Citye or towne, he wolde fyrste vysyte the te~ple or chirche, and so dydd ther, & walked aboute their temple, and behylde the maner of their sacrifyce & culture of their godds: where he sawe many ymagies & aulters. And vpon euery aulter was a title sett vppe to shewe in whose honour itt was dedicated. Some aulter was dedicate as appered by the scripture to the godds of Asia, some to the goddis of Europe, some of Aphryca. Some to Jouis, some to Mercury, to y=t= Sonne, to Mars, to the Mone, & to suche other. And ouer oon aulter was wryten for the tytle to whome that aultre was dedicate, thes words, Ignoto deo. This aulter is consecrate in y=t= honour of the vnknowen god, Ignoto deo, to y=t= vnknowen god. And Paule seynge this, cryed to the people, O uiri Athenie~ses, per omnia uideo uos superstitiosos. O you people of Athaynes, I see you almoste in all thynges concerninge your rytes & customes in your temple, some to oon god, some to an other. Vnto those that ar no goddes but creatures, symulacres


& deade thinges. And oon of your aulters is made & sactifyed Ignoto deo, to an vnknowe~ god. Quem ergo ignorantes colitis, hunc ego annuncio uobis. I am no p~acher nor teacher of newe goddis. I do not fayne any newe goods, but I do shewe you the olde god, the god of Gods, the God euerlastynge, the god that is withoute begynnynge, and shalbe wythe oute endynge, the onely god of heuen and earthe, whyche ye yet knowe not, & yet ye worshipe him in your te~ple at oon of your aulters: where is wryten Ignoto deo, Dedicat to an vnknowen god. Et quem ignorantes colits, hu~c ego annuntio uobis. I preache this god vnto you that ye do worshipe and knowe not. Ye worshipe a god att this aulter, and knowe nott whome ye worshype, nor what God, I preache hym vnto you. This is he that madde heuen and earthe & all that is therin. This is the lorde of all. Of him all thynges hathe theyr beynges. This is hee y=t= maade the fyrste man of nawght, & of that oon man, all man kynde. This is he that toke y=t= same nature oon hym, so to be knowen to the worlde. And that is itt that the phylosopher saythe, & rehersid by the appostul, Ipsius enim & genus sumus. He toke our nature and became man. In ipso viuim9, mouemur & sumus. In hym & by hym we lyue & haue our beynge, our monynge, our lyff, & all that we haue. This is he y=t= ye haue putt to deathe & passion. This is he y=t= suffrede for you, y=t= dydd


oon the crosse. He is the price of the world, he hathe washed you in his bloode. He was the sacryfyce for y=t= hool worlde, he made thattonme~te betwene god and man. He is the mediator that procurethe remyssion of synne. And he shall come ayen and iudge y=t= worlde. Hunc ego annu~tio uobis. This god I shewe & preche vnto you and to all y=t= worlde. This is he that was vnknowen, that nowe is knowen to all y=t= worlde to be God. And thoughe the Jewes and pagans, thoughe the Sarysons, infydelles, & Turkes, wyll nott yett knowe hym, worshype hym, nowe take hym for theyr god: att the daye of iudgemente whenne he shall come ayen, the~ne shall they see him, the~ shall they feale hym & knowe him. The~ shall thei

knowe y=t= hygh myght & power of this lord & god. Then shall they knowe y=t= he is veraye god. The~ shal he shewe hi~ sylf to y=t= world as he is, god & ma~. Ecce ueniet cu~ nubib9 & uedebit eu~ oi~s oculus & qui cum pupugerunt. Et plangent se super eum omnes tribus terre, etaim Amen. He shall come in y=t= clowdes and (as Mathewe dothe saye) he shall come in a grette maiestye, and euery eye, euery person, ye all ma~ kynde shall thenne see hym, ye and they that hathe pryked and prouoked hym to wrathe and to displeasur and they that hathe crudified & wou~dyd hym. All shall see & knowe hym the~ne. </narration>

(John Longland,. A sermo~d [sic] spoken before the kynge his maiestie at Grenwiche, vppon good fryday: the yere of our Lord. M.CCCCCxxxvi) London: s.n. 1536. STC 16795; COERP)


(2) <exposition> Whanne we þoru3 Goddis grace þese lettyngis haue fordon and oure hertis stablisschid, þan may we hope þat vs schal come þat we in preyer biseche. In hope þus vs setteþ oure Lord whan he lerneþ vs to calle <p11> hym oure fadir þat is in heuenes, ffor in hym men owen to haue certeyn hope þat may 7 wole alle goodis 3yue þat oure soule [3erneþ], þe which is vndirstonden þoru3 þis word: Pater noster, þat is: oure fadir. And þe power þoru3 þis word: qui es in celis: þat is in heuenes. In as myche as God techiþ vs to calle hym oure fadir, in þat he makiþ vs to vndirstonde þat he loueþ vs as his dere childre and þat he wole 3yue vs of his goodis aftir we haue need. </exposition> <exhortation> And wite þou wel forsoþe þat, þou3 alle þe loues þat euere were, or þat euere hadde fadir or modir to here childer, were festened in oo loue, 3it ne my3t it ri3tly by a þousande parte reche to þe loue þat God haþ schewyd to vs. </exhortation> <exposition> And þat we may vndirstonde þoru3 þe grace of God, if we wil see on what maner he is oure fadir 7 what he haþ don for vs.

At þe first bigynnynge, whan God made alle creatures of nou3t, we ne may fynde where he made any creature to his liknes but man oone. Fforþi is he God 7 maker to alle creatures 7 alle þingis þat aren in þis world 7 no3t here fadir called but here maker. But to vs for his mykil mercy he is oure God, oure maker 7 oure fadir, ffor oure soules he made to his owne liknes, þat is to þe liknes of þe fadir 7 of þe sone 7 of þe Holy Goost, þat is oo soiþfast God 7 þre persones. And alle þingis of þis world he haþ maad vs to serue, why þat we serue hym 7 loue hym as kynde childre owiþ to do. Ffor as sone as we leue þe loue of hym for þe loue of þe fleisch or of any oþer erþely þing, we leese þe lordisdome of þis worlde 7 bicomen þrallis to so vyle þingis, þere we were so fre as þe kyngis sones of heuene 7 lordis of alle þe world. Allas, wickid chaffare is þis; who so vndirstondiþ þe baleful lere þat þerof ariseþ [he suld ful sore hym dred].

Dere sistir, ne was it a token of greet loue whan God, þat [is] wiþoute bi_gynnyng 7 is in oon wiþouten chaungynge 7 schal be wiþouten endynge, þat is so my3tful 7 so wyse 7 so good þat no tunge may telle ne herte þenke 7 in whom is lijf 7 ioye endeles, deyned hym to make vs to his owne liknes, whanne he my3te haue leten vs ben a litil erþe, of þe which he made vs? Or my3t haue made vs haue ben, hadde it ben his wille, a toode or a neddir or sum oþer forschapen beeste, so þat we schuldyn haue dy3ed togydere body 7 soule. And dyd ouer oure desert 7 made vs men 7 3af vs soule to his holy liknes for to be as corowned kyngis in his endeles blys. It is noon so harde herte þat it ne au3t to melten altogedir in loue to God, if it wolde þenke hertely of his grete grace 7 loue þat oure Lord haþ schewyd to hym bifore alle oþer creatures. And 3it he dide wel more þoru3 his mykil mercy, whan we þoru3 oure waried synnes departiden oureself fro hym 7 bicome þrallis to þe loþely feendis of helle. þanne he, for þe mykil reuþe þat he hadde of vs, sent his <p12> derworþe sone, þat is oo God wiþ hym, to take fleisch 7 blood in þat blessid mayden Mary wiþouten tak of synne. Of hire he took liknes of þralle to suffre in þat liknes pouert, mysese 7 pyne, as he synful were, he þat neuere synne wrou3t. And þoled at þe ende so schameful 7 peyneful deeþ þat no tunge may telle ne herte þenke. And why? Certis oonly vs synful 7 gilti of his deeþ for to reuen out of þe deuelys prisoun 7 brynge vs a3eyn to his blis. þere he wole vs corowne wiþ þe coroun of eendeles ioye, if we vs kepe wel fro synne 7 do his wille, þat for vs dyed on þe rode tree.

Now hauest þou herde two þingis in þe whiche God haþ schewid þat he is oure fadir 7 so tendirly loueþ vs as his dere childre. þe first is þat he made vs to his liknes. Þat oþer is þat he bou3t vs wiþ his deeþ. Ffor þe first is man holden to serue hym [7] to loue [hym] wiþ al his my3t. What schal we þanne do for þat oþer? Ffor if I be bounden wiþ dette for to loue God 7 serue hym wiþ al my soule, wiþ al myn herte, euere 7 ay wiþouten ende, for þat he made me 7 3aue me soule to his owne liknes, as it was comaundide in þe olde lawe, ere God took oure kynde 7 bicome man, what may I now do to hym, þat for me, synful 7 his enemy, lowed hym so mykel þat he bicome man 7 3af hymself al to me, whan he wolde for me, vnworþi wrecche, for his mykil mercy þole so woful pyne 7 so schameful deeþ? I ne wot what I may here say, ffor, þou3 I my3te lyue a þousande 3eeris 7 my3te eche day suffre as bittir peyne as he suffrid for me, it were not to þat loue þat he haþ schewide to me. Whan he þat is soþfast God 7 Goddis sone 3af hymself for me, how may we þanne 7 on what wyse quyte hym þis riche 3ifte, þat he to vs, vnworþi wrecchis, so frely 7 so kyndely, to vs vnkynde so largely, to vs vnserued so riche tresour wolde 3yue? So, weileaway, bi manye may he þinke his traueil lost 7 birewen þe while. Lo what is to don in aquitaunce of þis dette þoru3 Goddis grace to oure derworþe Lord. Ne askeþ he not ellis  of vs but þat we lowen vs to hym 7 mekely knowe oure feblenes 7 oure wrecchidnes 7 þat we vndirstonden þat we noþing haue of oureself but oonly synne, but good, if þere any be, it is of God 7 not of vs. </exposition> <exhortation> Defoule we oure fleisch 7 pyne [we it] wiþ pe_naunce, aftir þat it may þolen, as it is wel worþi 7 wreke [we] oure Lordes good dede. And wiþ tendre teeris crye we mercy to hym, þat he saue vs þoru3 his hooly name and gyue vs wherof we may hym paye, ffor of oure_self ne haue we wherof ne wherwiþ.

(Þe Pater Noster of Richard Ermyte from Westminster School Library MS, ed. F.G.A.M. Aarts. The Hague: Martinus Nijhoff, 1967, pp. 3–56.)


(3) <1> A Roy nostre souerain seigneur

Besechen humbly youre Comunes of this present <2> parliament. that

<narration> where one Iohn Carpenter of Brydham in the Shire of Sussex husbund-man the vii daye of <3> Fevever the yere of youre noble reigne the viiite saying to Isabell his wijff that was of the Age. of xvje. <4> yere and had be maried to him but xv dayes. that they wolde go to gedre on Pilgremage and made to arraye hir in <5> hir best arraie and toke hir with hym fro the said Toun of Brydham to the Toun of Stoghton in the <6> said Shire. And there in a woode he smote the said Isabell his wijff on the hede that the brayne wende oute <7> and with his knyff yaf hire many other dedly woundes. And streped hir naked out of hir clothes <8> and toke his knyff and slitte hir bely fro the breste doun & toke hir bowels oute of hir body and <9> loked if she were with Childe And thus the said Iohn mourdered horrebely his wijff. of the which horryble <10> mourdure the thoursday next after the Fest of Seint Ambrose the Bishop the yere of your Reigne bi for said the <11> said Iohn was endyteth bi for Sire Iohn Bohun. knyght henri husee knyght and william <12> Sydney youre Commissioners of youre pees withinne the Shire forsaid and proces made oute <13> vpon the same endytement according to youre lawe. til the same Iohn Carpenter was outelawed of the said <14> mourdure and nowe graciously for the same cause Arest. and in youre Prisone called the kynges benche. <15> </narration> Please hit to your hie Rightwysnesse to considre …

(Petition concerning the murder of Isabell by her husband John Carpenter of Sussex (right side torn), Chancery hand, date 1433; John H Fisher et al. An Anthology of Chancery English, Knoxville 1984, 235–236)