A toolkit for constructing corpus networks [1]

Thomas Kohnen
University of Cologne


The aim of this paper is to look at the advantages and challenges of linking (diachronic) corpora and, thus, of constructing corpus networks. It indicates parameters which may serve as tools for determining whether corpora or sections of corpora can be compared and fruitfully linked. These parameters will be presented in the form of a toolkit for annotating (sections of) existing corpora. The parameters relate mostly to functional text structure (text functions, interactive format and publication format) and the position of texts within a genre or domain structure (hierarchies, sets and chains of genres). The article also looks at the possible interdependence of the different parameters and the various kinds of similarity they ensure.

1. Introduction

After the publication of the Helsinki Corpus several other diachronic corpora have been compiled (or are still in the process of compilation), many of them genre- or domain-based (for example, the Corpus of Early English Correspondence, Middle English Medical Texts, the Zurich English Newspaper Corpus, A Corpus of English Dialogues). Whereas some of these corpus projects have the aim to provide a larger and more comprehensive documentation for a specific genre or domain where the Helsinki Corpus contains only few data or none at all (for example, Middle English Medical Texts or the Zurich English Newspaper Corpus), others are designed as a basis for the development of a new research agenda (the Corpus of Early English Correspondence for the new field of historical sociolinguistics; see Nevalainen and Raumolin-Brunberg 2003).

The availability of many individual diachronic corpora, stemming from quite distinct, often unrelated domains of language use, raises the question how these could be combined in order to build up a more powerful data base and facilitate more reliable results. However, simply (and indiscriminately) "adding up" diachronic corpora will not necessarily help many researchers. Those researchers, for example, who are interested in the interactive format of texts and would like to find second-person pronouns directly appealing to the addressee of the text would have to browse millions of words and check manually thousands of items in order to get two or three hundred relevant tokens. Instead, they would probably prefer a smaller but more relevant data set to start with. This could be done by selecting and linking suitable parts of existing corpora which may meet more exactly the requirements of the study (in this case, the specific interactive format of a text). But how do we know that certain corpora or sections of them are actually comparable and can thus be combined to form a coherent data base for a given research task? What are the parameters that might provide the necessary metadata for corpus users to decide whether sections of corpora may be linked or not?

The aim of this paper is to look at the advantages and challenges of comparing and linking corpora. In particular, it intends to specify parameters which may serve as tools for determining whether corpora or sections of corpora can be compared and which parts may be fruitfully linked. I will present these parameters in the form of a toolkit for annotating (sections of) existing corpora and thereby deriving relevant metadata for constructing corpus networks of different size and dimensions. The parameters included in this toolkit relate mostly to text functions and the position of texts within a genre or domain structure. Thus, they are quite different from existing general coding guidelines that primarily aim at formal properties of text structure. [2]

My paper falls into four parts. After a short survey of previous research I will focus on two kinds of parameters: those relating to functional text structure (with text functions, interactive format and publication format as major categories) and those relating to the domain structure or network structure (with hierarchies, sets and chains of genres as major categories). In the conclusion section I will shortly address the issues of the possible interdependence of the different parameters and the various kinds of similarity they ensure.

2. Previous research

As far as I know, not much research has been devoted to the question in how far different corpora, in particular diachronic corpora or parts of them, can be combined to form larger data bases or corpus networks. However, the desire to make corpora transparent and to make explicit many features that text samples have or have not in common can already be found in early historical corpora, for example, the Helsinki Corpus. Here, the compilers included basic relevant information for comparing or linking text samples in the text headers of all files (the so-called textual parameters; see Kytö 1996, Chapter 3.3.4.). These parameters give basic information about the text sample (for example, about date, dialect, genre, prototypical text category, formality and interactive quality) as well as essential facts about the author. However, the information given here is in most cases fairly rudimentary and some broad descriptions (for example, the prototypical text category) may sometimes even be misleading (on this see below).

In a recent project, the European Database of Descriptors of English Electronic Texts (EUDDEET; Diller, De Smet and Tyrkkö 2010), these text parameters have been used as a basis for a stable set of categories for describing historical texts that are freely available on the web. In this approach the primary aim is not linking existing corpora but helping researchers to create their own corpora, once the texts are available with the relevant descriptions / descriptors. The set of descriptors listed in the 2010 publication relies heavily on the textual parameters used in the Helsinki Corpus. It comprises: author’s name, prototypical text category, genre and subgenre (here the Helsinki Corpus had 'text type'), verse/prose, written/spoken, author’s sex, author’s birth, date first published, author’s death, with the addition of the website where the text is found and some individual notes.

Without doubt, these descriptors can reveal important similarities and differences between texts or text samples, but at the same time some distinctions and links between texts in corpora will go unnoticed, especially those which are based on the text-functional and domain-based parameters. These parameters seem to influence the linguistic and functional profile of texts quite considerably and thus determine the extent to which they are similar and can be fruitfully linked.

3. Parameters of (functional) text structure

3.1 General text functions

What I call general text functions can be defined as functions of texts which recur inside texts and which may be typically associated with sections of texts (the basic notion goes back to Werlich 1983; for a more detailed account see Kohnen 2007, 2010 and Rütten 2011). For example, in texts which belong to the sphere of religious instruction, we typically find text sections devoted to exhortation, exposition, exegesis, narration, and sometimes argumentation. Thus, a sermon may comprise portions of text devoted to telling people what to do or not to do (exhortation), to telling stories from the Bible and interpreting them (narration and exegesis) and to giving more systematic accounts of the Christian doctrine (exposition).

These recurring text functions may be domain-specific or not. For example, the particular function of exegesis and the special combination of narration, exegesis and exposition may be typical of the religious domain. On the other hand, most of these functions recur in all texts in all domains and thus have a nearly universal nature.

The important point about general text functions in the context of linking text files of different corpora is that the predominant text function of a text or text section often determines the frequency and distribution of linguistic features in this text. Thus, a text with predominantly narrative sections will have different features than a purely expository text. This correlation between textual functions and the distribution of linguistic features has been shown on a broad statistical basis in Biber’s seminal work (Biber 1988), in particular with regard to his notion of textual dimensions. [3]

While there is a clear correlation between general text function and linguistic form (or, in Biber's terms between textual dimension and the distribution of linguistic features), there is hardly any clear association between text function and genre. Texts belonging to the same genre (or the same text type in the terms of the Helsinki Corpus) may be quite idiosyncratic with regard to the distribution of general text functions and text functions may co-occur in basically unforeseeable ways, with changing combinations and proportions. Thus, in the context of the Helsinki Corpus, texts that belong to the same text type or prototypical text type (for example, religious instruction or non-imaginative narration) may show major differences in terms of the occurrence and proportions of general text functions. For example, texts which belong to the sphere of religious instruction may be predominantly exhortative and narrative (for example, sermons) or predominantly expository and (in addition) exhortative (for example, treatises; on this see also Rütten 2011). On the other hand, texts which are associated with different prototypical text types may be quite similar in terms of general text functions. For example, early petitions as well as sermons and fiction, that is, texts stemming from completely different prototypical text types, may all contain long narrative sections, that is, they share one principal identical text function.

Example (1) below illustrates the predominant exhortative and narrative functions (with a short section of exposition) in an excerpt from an Early Modern sermon; example (2) contains prevalent expository sections, interrupted by small exhortative parts, in a Middle English religious treatise; example (3) shows a long narrative section in a late Middle English petition. [4]

(1) <exhortation> Lett vs therefore Christen people for whome Chryste hathe thus shedde his bloode, gyue tha~kes for his passion, lett vs ioye in his resurreccio~, lett vs laude hym for his ascencio~, for all y=t=  hoole catholique chirche thrughoute the worlde dothe reioyce therein: besechinge him that dydde this moche for vs, that he wyll of his mercyfull godenes brynge and make vs tascend in to thys hys glorye. Lett vs Chrysten people reioyce y=t= cryste was borne, to teache vs, that he died, to heale vs. For his crosse was deathe to hym, & to vs lyffe. [Continues in Appendix]
(2) <exposition> Whanne we þoru3 Goddis grace þese lettyngis haue fordon and oure hertis stablisschid, þan may we hope þat vs schal come þat we in preyer biseche. In hope þus vs setteþ oure Lord whan he lerneþ vs to calle <p11> hym oure fadir þat is in heuenes, ffor in hym men owen to haue certeyn hope þat may 7 wole alle goodis 3yue þat oure soule [3erneþ], þe which is vndirstonden þoru3 þis word: Pater noster, þat is: oure fadir. And þe power þoru3 þis word: qui es in celis: þat is in heuenes. In as myche as God techiþ vs to calle hym oure fadir, in þat he makiþ vs to vndirstonde þat he loueþ vs as his dere childre and þat he wole 3yue vs of his goodis aftir we haue need. </exposition> [Continues in Appendix]

<1> A Roy nostre souerain seigneur
Besechen humbly youre Comunes of this present <2> parliament. that
<narration> where one Iohn Carpenter of Brydham in the Shire of Sussex husbund-man the vii daye of <3> Fevever the yere of youre noble reigne the viiite saying to Isabell his wijff that was of the Age. of xvje. <4> yere and had be maried to him but xv dayes. that they wolde go to gedre on Pilgremage and made to arraye hir in <5> hir best arraie and toke hir with hym fro the said Toun of Brydham to the Toun of Stoghton in the <6> said Shire. And there in a woode he smote the said Isabell his wijff on the hede that the brayne wende oute <7> and with his knyff yaf hire many other dedly woundes. [Continues in Appendix]

So it is likely that the indication of the basic genre or prototypical text category does not give reliable information about the prevailing general text functions and, consequently, about the linguistic forms that are to be expected in the texts (for example, one would not expect so many past tense forms typical of narrative sections in a statutory text).

Thus, the specification of the predominant general text functions of the text samples contained in a corpus is an important prerequisite for establishing reliable links between corpora and combining sections which should be comparable. However, it seems that in most diachronic corpora these specifications are lacking and researchers will have to approach the text samples by reading them and testing the principal general text functions.

3.2 Interactive format

Another important functional parameter relevant for linking corpora is the interactive format of texts or text samples. Central dimensions of the interactive format of texts are whether they are presented in the form of a dialogue or a monologue, and – if it is a monologue – whether they have a strong orientation towards the reader or not. Thus, a handbook on witches may take the form of a fictional dialogue or simply expound relevant facts. A religious treatise may be written in the form of a letter and directly appeal to the recipient or the author may only state and explain facts without any explicit address to the reader.

It goes without saying that the interactive format of a text has important consequences for the frequency and distribution of several linguistic features (for example, personal pronouns, address terms, deictic elements, first- and second-person verb forms, interrogative sentences, elliptical clauses etc.) and whether they refer to a fictional context or a context shared by the addressor and the reader of the text.

The excerpt from a religious treatise included in example (4) below is cast in the form of a fictional dialogue, with many second-person pronouns that aim at the interactional partner in the fictional dialogue.


Of the Conversion of a Sinner; What it is?
Speakers. Paul, A Teacher
                   Saul, A Learner,
Paul. Well Neighbour; Have you examined your self by the word of God, since I saw you, as I directed you?
Saul. I have done what I can in it.
P. And what do you think now of your case upon tryal?
S. I think it is much worse than I had hoped it was; and as bad as you feared: …

(Richard Baxter, The poor man's family book, 1674; COERP)

Thus, the interactive format is an important parameter to be included in the metadata for linking corpora, in particular, since fictional dialogues or letter formats appear in historical texts quite unexpectedly. Apart from the above example, fictional dialogue is found in Middle English and Early Modern English handbooks and secular treatises. The letter format occurs in many Middle English treatises as well, with a strong reader orientation. The same applies to prefatory material attached to handbooks, catechisms and collections of treatises. Here, the relevance of the interactive format can be seen not only in the frequency of address terms and deictic elements in the texts but also in their differing reference. Whereas in letters and prefaces they will in most cases refer to the recipient and his / her world, they will remain "within the text" in most fictional dialogues.

3.3 Compilation / publication format

Another relevant parameter which belongs to the functional text structure involves the compilation or publication format of the text or text sample. This parameter determines the "neighbourhood" or "background" of a text, whether it is part of a larger manuscript containing other texts, whether it is part of a so-called commonplace book, whether it was published as a pamphlet, whether it can be found in a newspaper or journal, whether it is part of a published collection of texts or whether it is simply part of one book.

The compilation or publication format of a text is quite relevant when text samples of different corpora are linked, since it may affect the structure and function of texts. For example, commonplace books (that is, collections of texts which were compiled for future reference or further use, containing a wide range of different texts and genres) were designed as multifunctional reservoirs which could be reused in multi-layered communication practices involving role shifts in text communication and changes of text function (see Kohnen 2011). Thus, a law text or a piece of religious instruction contained in a commonplace book could have differing addressors and addressees or serve different text functions, depending on how they were re-used (for example, the compiler could be the recipient or the addressor of a text of religious instruction). The publication type "pamphlet", serving as a platform for publishing a wide variety of different genres, had an impact on the form and function of the respective genre, for instance, by making petitions or letters more interactive and less formulaic (see Claridge 2000 and Groeger 2010). In a similar way it is important to know whether a text sample stems from a newspaper and to which of the several sub-genres it belongs (for example, news section, obituary, advertisements etc.).

4. Parameters of domain structure

Now I turn to parameters that relate to the domain structure underlying texts and genres. The concept of a domain structure or network structure of genres goes back to the investigation of research genres by John Swales (2004) and the work on genre dynamics in historical medical texts by Irma Taavitsainen (2009, 2010), but also to research on religious genres in the context of the compilation of the Corpus of English Religious Prose. It includes the idea that the genres of a particular domain (for example, religious discourse, discourse of science, discourse of mass media) form a structured network that can be captured in a systematic fashion. [5] Main elements of the domain structure are hierarchies (different basic constellations of discourse participants involved in the communication; Kohnen 2010), sets (groupings of genres according to their relative position in the discourse community) and chains (chronological or logical sequences of genres within a set; Swales (2004)). Several large corpus projects are in fact domain-based (for example, the Corpus of Early English Medical Writing and the Corpus of English Religious Prose).

4.1 Hierarchies

In the present context, hierarchies are groups of genres ordered according to different basic constellations in which the discourse participants communicate with each other (for example, a regulation issued by a king for his subjects involves a different constellation or hierarchy than personal letters exchanged by fellow-students).

Among the hierarchies, I would like to distinguish a first-order, a second-order and a third-order sphere. The first-order sphere contains texts that are issued by a superior, binding body or authority and are directed at all the members of the discourse community. Typical examples are laws in the administrative domain or God’s word as seen in the canon of the Biblical writings in the religious domain. The second-order sphere contains texts which work the other way round. Here the members of a discourse community address a superior authority or institution. Typical examples are petitions to higher institutions in the administrative domain and different forms of prayer in the religious domain. The third-order sphere contains texts with which members of a discourse community, who do not form a superior body or institution, communicate with each other. Typical examples are private letters, sermons, handbooks, etc.

Let us take a closer look at the first-order sphere, with a superior, binding body or authority addressing all the members of the discourse community. In the religious domain we find here the Bible (conceived as God’s word); in the administrative domain there are statutes and laws (issued by the legislative, that is law-making, authority directed at the community of subjects or citizens); in the domain of science and education there are, for example, fundamental treatises of established medical authors or essential text books which in their respective periods are held to contain incontestable truths or – later in the history of science – common views for the time being established in the discourse community. All these texts and genres have in common that they apply to the whole of the discourse community.

Typical genres which belong to a first-order hierarchy share several features which are relevant when linking or comparing text samples from different corpora. Such texts usually enjoy a high prestige and serve a central function in their respective discourse domains. They are often used as a fundamental point of reference and a common basis for justification (for example, the texts of the Biblical tradition or laws). Also, these texts are often found cited in other texts (of the same and other domains). Lastly, they are not liable to change (unless, in the case of the Bible, a new translation is made, or, in the case of statutes, a new law is passed). When comparing and combining sections from different corpora, researchers need to be able to identify this central, privileged position of texts from the first-order sphere.

It is important to see that in most domains of language use membership of the first-order sphere does not seem to be fixed, but may be blurred and change. For example, in the domain of science established authors and texts may lose their central position in the discourse community and be replaced by others. (Taavitsainen 2010 describes such a shift in the domain of science, which was, however, rather slow, after the foundation of the Royal Society in 1660, with new prestigious genres evolving (experimental reports, book reviews) and new stylistic guidelines being set up for other genres). In addition, the relevant texts in the first-order sphere in earlier periods of the history of English may not necessarily be in English, but mostly in Latin (sometimes in Greek or French). For example, in the religious domain the Latin Vulgate version of the Bible was predominant (and officially prescribed) until the first decades of the 16th century and in the domain of medicine most accepted texts were Latin right until the 16th century (see Taavitsainen 2010).

Now to the second-order sphere, with members of a discourse community addressing a superior authority or institution. Typical examples in the religious domain are the various forms of prayer, both liturgical and private. Here members of the Christian community address the superior, transcendental authority of God. In the administrative domain we are dealing with genres of the second-order sphere when subjects or citizens address superior legal institutions (for example, using petitions or official letters, possibly also in wills when appealing to a higher authority). In the domain of science we may include letters sent by private researchers to a super-ordinate scientific body or authority (for example, the Royal Society before publication in the transactions).

The common feature of texts belonging to the second-order sphere is that they appeal to a higher authority, which may result in similar text structures and several common linguistic features (for example, a text section devoted to a petition, address terms, forms of directives, forms of positive and negative politeness etc.). The similarities (but also differences) may be seen when comparing a petition with a private prayer (see Kohnen 2008).

The third-order sphere, with members of the discourse community addressing other members of the discourse community, seems to comprise a large variety of genres in all domains. For example, in the religious domain there are the genres which belong to religious instruction or theological discussion; in the administrative domain we can include agreements between different parties or handbooks on legal advice; we also find handbooks in various other domains and on different subjects; different kinds of (private) letters issued either from a lower or a higher social rank also belong to this sphere. The important feature shared by all these genres is that members of a discourse community who do not form a superior institution communicate with each other.

In some cases it seems difficult to decide whether members of a discourse community act as part of an established institution or not. For example, in the religious sphere the writings of the Church Fathers and certain other official doctrinal documents (for example, in the Catholic Church issued by the Pope or a Council) would hardly belong to the third sphere. In the domains of science and administration the assignment of particular texts (or authors) may be quite difficult (see also Taavitsainen 2010). By contrast, one of the special characteristics of the private domain seems to be that it only comprises genres of the third sphere.

In addition to the large variety of texts and genres found in the third sphere, it is a special characteristic of this sphere that its texts do not seem to share particular textual or linguistic features. Thus, it may be that the category is too broad and other domain-specific features are needed in order to establish finer links. For example, in the third sphere of the religious domain there is the additional distinction between religious instruction and theological discussion. It remains to be seen whether parallel distinctions can be drawn in other domains and what further divisions can be introduced.

4.2 Sets

Genre sets are groupings of genres which reflect their relative (that is, more central or more peripheral) position in the discourse community. The basic distinction is between core, minor and associated genres. Core genres are genuinely rooted in the domain (for example, prayers and catechism in the religious domain, laws in the administrative domain). Minor genres are similarly rooted in their respective domain, but they do not apply to all members of the discourse community since they are used only in special institutions or by specially qualified people (e.g. monastic rules and liturgical prayer in the religious domain; law commentaries or summons in the administrative domain). Associated genres originally exist outside their respective domain and are not genuinely part of it but become associated with it at some point in time due to cultural or technological developments (for example, pamphlets and prefaces as a result of the printing press). Set membership is a highly relevant parameter when specifying the metadata for linking sections of different corpora. It is, of course, important to know whether a given text stems from a core, a minor or an associated genre in its respective domain since this has major consequences for their spread and their audience, but also for the linguistic features found in them.

A particularly interesting set, which may show comparable properties across different domains, is the group of the associated genres. Prototypical associated genres are prefaces and pamphlets. Both may be found in various domains. For example, the religious domain includes prefaces to collections of treatises or sermons, to catechisms and religious biographies; similarly, the domain of science offers prefaces to treatises and handbooks; in the administrative domain we find prefaces to collections of statutes and commentaries. On the other hand, pamphlets form a platform for publishing letters, sermons, petitions, treatises and many other genres from various domains (see also the different subdivisions of the Lampeter Corpus in terms of the domains religion, politics, economics / trade, science and law; see Schmied and Claridge 1997).

Preliminary research with the data of the Corpus of English Religious Prose has shown that associated religious genres differ from core genres in that the distribution and frequency of typical religious features is closer to that of secular genres (Kohnen forthcoming). Against this background it seems likely that associated genres of different domains show certain common features and it may be rewarding to compare them across different areas of language use. At any rate, it should not go unnoticed that a particular associated genre may have equivalents in completely unrelated fields.

4.3 Chains

Another relevant parameter which should be considered when linking corpora are genre chains. The term 'genre chain' was introduced by John Swales (2004) in order to capture the chronological and logical sequences of genres in scientific writing. For example, a review both chronologically and logically follows the work which is reviewed. In the present context, one significant feature seems to be whether a genre starts a chain or constitutes the endpoint of a chain. Generally, the chain-initiating status of a genre is often associated with a specific interactive format, namely the dialogue and letter format, especially in Middle English and Early Modern English. For example, in the religious domain, catechisms, which have to be seen within an initial chain relation to the other (more "advanced") core genres of religious instruction, typically show a dialogue format. In a similar way, we find dialogues in chain-initial genres in several other domains (for example, accessible medical treatises (see Taavitsainen 1999), straightforward handbooks on a variety of topics (on this see, for example, several handbook excerpts in the Early Modern section of the Helsinki Corpus)). Similarly, letter formats in non-correspondence texts often form starting points of genre chains in that they give first and straightforward instruction (a good example is William Cobbett's English Grammar in the form of letters addressed to his son; Cobbett 1983).

Thus, it seems likely that the chain position of a genre has important consequences for the textual and linguistic structure of its texts. It should be included in the parameters which are relevant for linking and comparing corpora.

5. Conclusions

The aim of this paper was to explore important elements of a toolkit which may provide faithful standards of comparison when linking domain-based corpora (or parts of them) and thus form a basis for constructing corpus networks. My emphasis was on two kinds of parameters, text-functional and domain-based parameters (see Table 1 below). Text-functional parameters relate primarily to the structure of the text or text sample in a corpus, whereas domain-based parameters determine the relative position of the respective genre / text within the discourse community.

text-functional parameters domain-based parameters
general text functions
(e.g. narration, exposition, argumentation etc.)
interactive format
              reader orientation
              no reader orientation
        core genres
        minor genres
        associated genres
compilation / publication format
        manuscript collection
        commonplace book
        catechism -> sermon -> religious treatise
        medical dialogue -> medical treatise

Table 1. Text-functional and domain-based parameters

It is important to see that the two kinds of parameters open up quite diverse perspectives on texts and text samples in a corpus and thus may imply quite different kinds of similarity. Text-functional parameters ensure above all similarity of linguistic structures, whereas network parameters mostly ensure similarly situated genres (which then, of course, may share certain linguistic features). For example, a similar interactive format suggests a comparable frequency and distribution of interactional linguistic features (personal pronouns, deictic elements, address terms etc.). Membership of a first-order sphere implies high prestige and great intertextuality of the text in comparison to the other texts / genres in the discourse domain.

Texts which belong to the first-order sphere will not necessarily be similar in terms of linguistic form. Rather, they will represent the idiosyncratic linguistic features of the respective domain in the purest form (for example, the typical administrative style in statutes or the typical religious register in Bible translations). Since the differing idiosyncratic features of a domain-based style may be quite incommensurable and also resistant to change, first-order genres may not necessarily be comparable in these linguistic respects at all but form interesting contrasts. On the other hand, associated genres of the third-order hierarchy will not represent the typical linguistic features of their domain to such an extent and may include more "common" linguistic features that are less domain-specific.

Also, the network parameters typically interact with each other. Hierarchies may imply a specific set membership and chain position and vice versa. For example, if a text or genre belongs to a first-order sphere (laws, the Bible or fundamental academic text books), it will typically form a core genre in the domain and often be non-initial. By contrast, associated genres typically belong to the third-order sphere.

These intricate relationships and contrasts nicely illustrate that the parameters included in this toolkit may in fact cover quite diverse research interests and may offer wide possibilities of contrastive and comparative analyses. On the one hand the toolkit allows for the construction of rather specific and restrictive compilations (for example, collecting texts with address terms only referring to the addressee of the texts), on the other hand it might serve as the basis for constructing comprehensive corpus networks that facilitate broad comparisons across domains.

One obvious requirement which the model envisaged in this article should meet in the future is application in more domains. So far illustrations were given for the religious domain, the administrative domain and the domain of science. One particular problem seems to be that in some domains (for example, the private domain) there are no first- and second-order spheres. In other domains (mass media, education, economy) more detailed research is necessary to trace the hierarchies in the multi-layered relationships between institutions, text traditions and language users.

A concrete first step for testing the practicability of the suggested toolkit and its different parameters might be a systematic comparison of the Early Modern part of the Corpus of English Religious Prose (COERP) and the Early Modern English Medical Texts (EMEMT) corpus. Both COERP and EMEMT are domain-based corpora and here the feasibility both of the text-functional and the domain-based parameters in establishing systematic and fruitful links between texts from different domains can be assessed. A complementary test could be conducted with the various pamphlets contained in the Lampeter Corpus. Here we are dealing with one associated genre with texts from different source domains. These and other practical analyses could help to create the first outlines of a diachronic corpus network but would also refine and complete the parameters needed for constructing it.


[1] High-resolution images of sacred texts can be seen at the British Library Online Gallery, for example the Psalter of Henry VIII.

[2] See, for example, the P5 Guidelines of the Text Encoding Initiative. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html

[3] That is, the functionally interpreted co-occurrence patterns of selected morpho-syntactic features. Note, however, that the text functions presented here are not defined by linguistic features but based on the account given in Werlich (1983). Also, the number and range of Biber's textual dimensions is quite different from the general text functions presented here.

[4] In the following extracts, exhortation is highlighted in red, exposition is highlighted in green, and narration is highlighted in yellow.

[5] The notion of domain is also used for classifying texts in the British National Corpus. However, there it is used as a text categorisation for written texts only, with the basic subdivisions of "imaginative" and "informative" writings (with several subfields) (see http://www.natcorp.ox.ac.uk/corpus/creating.xml). The term as it is used in this study is much broader, including also the socially defined institutions and frameworks for the formulation and dissemination of texts.


