What kind of corpus annotation is needed in sociopragmatic research?

Minna Palander-Collin, Helsinki Collegium for Advanced Studies, University of Helsinki

1. Introduction

In this paper I shall discuss problems and questions concerning the use of corpora for the purpose of sociopragmatic research. My own work draws on the fields of historical linguistics, corpus linguistics, sociolinguistics and pragmatics. In pragmatics in particular, the existing corpora and corpus tools, and the corpus-linguistic approach in general, cause some anxieties. It is not inherently impossible to combine corpora and pragmatics, of course, but until fairly recently, corpora and corpus tools were not usually designed for pragmatic research or by pragmaticians. The field of historical pragmatics in particular has attracted researchers with corpus expertise, and there are now pragmatically-oriented corpora like the Corpus of English Dialogues 1560-1760 and its annotated subsection for the years 1640-1760, the Sociopragmatic Corpus. These corpora are expressly intended for variationist studies and historical pragmatics (see Kytö & Walker 2006 for the CED, and e.g. Archer & Culpeper 2003 for the SC). Moreover, researchers seem to be increasingly interested in cross-fertilizations between different disciplines, such as quantitative corpus linguistics, sociolinguistics and qualitative discourse-pragmatics. [1]

Corpora are always collected for a particular purpose or with a particular framework in mind. Even general-purpose corpora are based on the compilers' preconceptions about language. Therefore, it is necessary for these preconceptions to be spelled out, and ideally a corpus will also be flexible enough to respond to other approaches. An understanding of the relevance of a given corpus and of the compilation principles and research goals that have influenced its structure are prerequisites for its use and the development of further corpus-related resources such as sociopragmatic annotation schemes.

My discussion here is based on my experience as a corpus compiler and user of two sociolinguistic corpora: the Corpus of Early English Correspondence 1410?-1681 (CEEC) and the Corpus of Early English Correspondence Extension 1681-1800 (CEECE). [2] These corpora were originally designed for sociolinguistic studies of morphological variation and change (Nevalainen & Raumolin-Brunberg 1996: 39) and have so far been published as the Corpus of Early English Correspondence Sampler (CEECS) [3] and the Parsed Corpus of Early English Correspondence (PCEEC). [4] The corpora are well suited to pragmatic studies, but the underlying compilation principles and coding conventions are sociolinguistic. This means, for example, that the type of contextualization needed for sociopragmatic research is not always easy or even possible.

The type of sociopragmatic research I am referring to aims to account for both macro-level sociological factors and micro-level situational factors, explaining why people in a given situation use language in the way they do. This is essentially an integrationist approach to social theory and sociolinguistics that recognizes both the existence of macro-social structures and the importance of individual communicative events in creating social meaning (cf. Coupland 2001: 15-17). Language and social structure are in a dialectical relationship, such that language both constructs and reflects social structure, social identities and relationships; in this sense, language is a social practice rather than a purely individual activity (Fairclough 1992: 63-64). Ideally, sociopragmatic research produces results that increase our understanding of the role of language in society and can be generalized to wider contexts. The achievement of this aim requires plentiful corpus data, as well as some form of sociopragmatic annotation that contributes to the efficient use of the corpus.

I shall discuss the need to develop a sociopragmatic annotation scheme in relation to two on-going research projects that employ early English letters from the CEEC and CEECE as research material. These projects will illustrate the fact that, although corpora are useful, they have limitations that may restrict the range of research questions linguists will be able to ask, while other questions that lend themselves less to a computer-based approach might be dismissed as marginal (Meurman-Solin 2001 is particularly concerned with this bias). The more established corpus linguistics becomes as a research paradigm, the more likely unquestioned orthodoxies are to develop.

The first project focuses on describing the types of linguistic style that constitute and express social identities, roles and hierarchies between correspondents (Palander-Collin 2002, 2006, forthcoming). It focuses on language as an act of identity — in other words, the relational and interpersonal functions of language and the role of these in language variation and change. In studying language as social interaction and a means of signalling and building identities through the medium of correspondence, it is essential to be able to establish as much background and contextual information about the letters and their writers and recipients as possible. Moreover, operationalizing the identity function as a set of computer-searchable linguistic items is a challenge, as identity positions might be expressed by widely varying and open-ended linguistic means like topic choice, humour, address formulae, or pronouns.

Focusing on reporting and its functions in eighteenth-century correspondence, the second project could be characterized as form-to-function mapping (Palander-Collin & Nevala 2006, Nevala & Palander-Collin forthcoming; for form-to-function and function-to-form mapping, see Jacobs & Jucker 1995). In this project, we have been concerned with defining reporting and identifying reporting constructions, but our ultimate aim is to understand why and when reporting is used and what kind of interpersonal functions it serves. Our results show that the relationship between letter writers and their recipients is an important factor conditioning the use of reporting. Our biggest problem with regard to the use of corpora concerns finding relevant examples of reporting, as we have not been able to rely on automatic searches, but have had to manually identify all the relevant occurrences of reporting in the material.

I have now sketched the nature of the issues I wish to discuss. More precisely, these include the following: 1) how to contextualize language for the purposes of sociopragmatic research, 2) how to operationalize the identity and interpersonal functions of language as computer-searchable items, and 3) how to search the corpus for a diverse or fuzzy linguistic category. With this discussion, I wish to explore how existing corpora match the needs of sociopragmatic research, how they can be utilized at the present time, and how they could perhaps be improved with sociopragmatic annotation. Easily retrievable contextual information is one type of information that may be defined and categorized for a certain type of material in such a way as to serve a wide variety of sociopragmatic research questions. Establishing general-purpose linguistic categories to be annotated for a broad range of sociopragmatic research questions seems more problematic. Thus, the type of annotation scheme I am inclined to promote here is essentially 'problem-oriented', in the sense that it is relevant to specific research questions (cf. McEnery et al. 2006: 43). In the long run, such annotation schemes focused on particular research questions could contribute to our understanding of what the general-purpose categories of a 'pragmatic grammar' might be.

2. Contextualizing language

The relevance and definition of context varies in different linguistic paradigms, as shown by Archer & Culpeper (2003) in their discussion of research traditions in corpus linguistics, historical linguistics, sociolinguistics and pragmatics. Context is also regarded as a difficult notion to define, as contextual information is always identified in relation to "something else", which is the main focus of research (Schiffrin 1994: 362).

Even those disciplines that regard language and context as inseparable often emphasise different aspects of context. In variationist, correlational sociolinguistics, such social categories as gender, class and ethnicity tend to be regarded as discrete, stable and primary contextual factors affecting linguistic variation, whereas some strands of conversation analysis do not allow any prior categorization, requiring interpretations to be based on insiders' understanding of what makes talk comprehensible for them at that moment (e.g. Coupland 2001). Schiffrin (1994: ch. 10) identifies 'context as knowledge' and 'context as situation' as two different understandings of what context is, although both views can be simultaneously evident to varying degrees in many approaches. When defined as knowledge, context is viewed essentially in terms of knowledge that the interactants can be assumed to have, such as awareness of social institutions and of the general wants and needs of others. Context as situation, meanwhile, signifies knowledge of the "here and now". Context as knowledge of cultural norms and conventions and context as situation can hardly be separated in sociopragmatic research, as language use is analysed and interpreted in relation to who the interactants are, what kind of an interaction they are involved in, and why they are communicating. The definition of appropriate and expected participant roles and linguistic behaviour is based on the interactants' cultural and situational understandings.

Many discourse studies have identified the following situational parameters as important components of speech situations that can be used to define registers, for example (Biber & Conrad 2001: 175):

  • the participants, their relationships and their attitudes toward the communication
  • the setting, including factors such as the extent to which time and place are shared by the participants and the level of formality
  • the channel of communication
  • the production and processing time (e.g. amount of time available)
  • the purpose of the communication
  • the topic or subject matter

Fairclough's (1992) three-dimensional framework of discourse analysis provides a tool for contextualization for sociopragmatic research, and it can also be used to view the situational factors listed above in a wider sociological framework, rather than as items that are somehow separate, discrete and unmotivated. [5] Fairclough (1992: 62-100) maintains that 'discourse' (spoken or written language use) should be analysed in context as text, discursive practice and social practice, as these facets are interconnected and affect each other in such a way that texts/language both constitute and reflect discursive and social practices.

Discursive practice concerns the production, distribution and consumption of texts, while social practice concerns the ideologies and hegemonies that have a material existence in the practices of institutions and the constitution of subjects. Ideologies are essentially significations/constructions of reality, including the physical world, social relations and social identities (Fairclough 1992: 78-96). Participants in a communicative situation produce language that is possible and appropriate for them in a particular society in the context of its discursive and social practices.

The discursive practices relevant in the study of early English letters include issues relating to both the original letter-writing situation and the circulation and preservation of the letters up to the present. CEEC and CEECE contain personal correspondence published as scholarly editions. They have mostly been written by individual writers to individual recipients, but it was also common to circulate personal letters and read them aloud to others. In this way, there may have been 'overhearers' whose needs the writer had to keep in mind. Moreover, the modern editor is a mediating author making sense of old texts to modern readers. In the process, he may change the original letters in various ways, such as by modernizing the spelling, adding punctuation or cutting sensitive or uninteresting material to make the text suitable for the readers. Meurman-Solin (2001) is particularly concerned with the authenticity of the texts being digitized, and her solution has been to check the texts against the manuscripts when collecting the Corpus of Scottish Correspondence (see Meurman-Solin in this volume).

When producing letters, early modern writers also followed discursive conventions of the genre, such as recognizing the intended recipient at the beginning of the letter and reflecting the writer-addressee relationship in the address formula (see e.g. Nevala 2004: 33-53). It should be noted, however, that the authenticity of the letters may also be doubtful, as literacy was low, particularly at the beginning of the period covered in the CEEC. Therefore, personal letters might be written by scribes rather than the actual author. For the same reason, many letters were never written at all, and the voices of the lowest ranks and women are scarce. Moreover, letters were preserved in various ways, such as in letterbook copies by the original author or someone else, or were simply destroyed. The ideologies of more recent times also affect the selection of letters available as editions — it is easy to find editions of eighteenth-century letters by literary figures of the period, for example, whereas editions containing letters by ordinary people are still rare.

In terms of discursive practice, letters also function as written communication and authentic interaction between writers and recipients. As such, they can be expected to contain interactive elements that construct an appropriate authorial self and negotiate accepted participant relationships (cf. Hyland 2005a, 2005b). Therefore, they provide an important source for studies of earlier patterns of interaction, particularly as more contextual information is usually available than for many other written genres.

Social practices of earlier periods often escape direct observation, but fortunately there is ample written evidence of early modern England to be interpreted by (social) historians. Wrightson (2003: 25), for example, writes about early modern England that 'the most fundamental structural characteristic of English society was its high degree of stratification, its distinctive and all-pervasive system of social inequality'. This can be seen most prominently in the hierarchy of social ranks, described by such commentators as William Harrison in 1577 as consisting of four degrees of people: gentlemen (nobility, knights, esquires, mere gentlemen), citizens and burgesses, yeomen of the countryside, and poor people such as day labourers, poor husbandmen, artificers, and servants, who had 'neither voice nor authoritie in the common wealthe, but are to be ruled and not to rule other' (Wrightson 2003: 27).

Social inequality and hierarchy was also evident in the private family sphere, where the roles and standing of household members were defined according to their gender, age and 'place', the husband or the master being the household head and the wife his 'associate', 'yokefellow', 'subordinate' or 'deputie'; together, they would have authority over children, servants and apprentices (Wrightson 2002: 42). It is, of course, difficult to know to what extent and in which social conditions these ideas reflected reality, or what kind of individual variations were permitted. Diachronically, social as well as discursive practices change, and the relevance and form of hierarchy, for example, together with its linguistic coding, have changed over the centuries.

Focusing on the single genre of early modern English personal letters allows us to overlook some situational factors that stay constant. The channel of communication (written), the setting (time and place are not shared), and the production and processing time are more or less similar in all personal letters, although some letters were written in more urgent circumstances than others, which may even be indicated by the letter writer with the phrase in haste. The participants, their relationships and attitudes, and the purpose and topic of communication are then the central situational variables in personal letters.

On the basis of sociohistorical and sociolinguistic research, important participant parameters enabling contextualisation include at least the classic sociolinguistic and synchronically stable social variables of sex, age, domicile, social rank, education and year of writing, which characterize the informants in all situations. Of course, it is impossible to know with any certainty how individuals saw their position and identity, how important they considered such factors as their sex or age in any given interaction. Theories of identity typically distinguish between personal (individual) and social (group or collective) identities, but there is no clear agreement as to what beliefs a person's self-concept comprises, and these may vary from personality traits, abilities, physical features or behavioural characteristics to ideologies, social roles, language affiliations and group memberships (Spencer-Oatey 2007: 640-641). Through the use of the contextualization I am describing, we can hopefully establish some aspects of the informants' social identities.

In addition to classic sociolinguistic variables, individuals have various roles that may change from one situation and recipient to another, or even within a situation. These may include family roles such as 'younger brother' writing to 'elder brother' or 'mother' to 'daughter', professional roles such as 'administrator' to 'administrator' or 'master' to 'servant', or more vague and informal social roles such as 'friends'. The roles that can typically be found in the letters of the CEEC and CEECE are somewhat different from those in the Sociopragmatic Corpus, which also contains dramatic roles such as 'villain', 'fool' or 'seducer' (Archer & Culpeper 2003: 50). In general, drama and trial proceedings such as those included in the Sociopragmatic Corpus are text types that perhaps assign roles to the participants more conspicuously than personal letters — in cast lists or stage directions in plays, for example, or as institutional requirements and formal conventions in court records.

In personal letters, the relationships between the correspondents have to be established on the basis of internal evidence (e.g. address forms), information provided by the editor or other secondary sources (e.g. Oxford Dictionary of National Biography) and findings from research on early modern society (e.g. social ranks, family relations, gender roles). This process does not always allow specific characterizations of the correspondents, even such vague ones as 'friends'; instead, more general and in this case also more reliable classifications like 'noblewoman' to 'noblewoman' have to be used. In particular, personal information about people of lower ranks is scarce.

It may also be somewhat difficult in many cases to define the purpose and topic of writing, particularly if entire correspondences are not available. Even if sequences of letters have been preserved (and included in the corpus), it may still be difficult to see the situation behind the letter, as discursive and social practices may have changed or the letters may be highly contextualized. Moreover, letters may contain "language games" or private jokes, and writers may lie or pretend to be someone else. Personal letters in early modern England typically serve multiple purposes and deal with several topics at the same time; for example, a family letter may contain business in addition to greetings and family news.

As sociolinguistic corpora, the CEEC and CEECE were primarily compiled to be representative of letter writers of various social ranks as well as of both men and women; the geographical and temporal balance of the writers was also important. The social ranks were operationalized as nobility, (upper and lower) gentry, clergy, professionals and "other" informants, including such categories as servants, yeomen and paupers. Moreover, careful attention was paid to the quality of editions and letter authenticity, and unreliable sources were discarded.

However, particular attention was not paid to the recipients or the continuity of correspondences, which are important in contextualising individual letters and answering pragmatic questions. For correlational sociolinguistic questions concerning the social embedding of morphological and syntactic changes, the writer-recipient relationship is not as important as the writer's domicile, gender or rank (Nevalainen & Raumolin-Brunberg 2003). Consequently, the compilation process was writer-centred. This is reflected in the letter identification section in the original versions of the corpora (CEEC, CEECE and CEECS). This section, which comes before each letter, offers information about the authenticity of the letter, the year of writing, the relationship between writer and recipient, the identity of the writer, and the page number of the letter in the edition. An example such as <Q A 1570S T NABACON I, 11> categorises the recipient as non-family recipient/acquaintance (T); other possible recipient categories are nuclear family (FN), family other (FO), family servant (FS), or close friend (TC). The writer identification section does not identify the recipient, nor does it provide information about the writer's social rank, age or gender.

Since early modern England was a hierarchical society, it was important in both the studies described here to establish more specific hierarchical relationships between the correspondents. Thus, letters to family members and acquaintances can further be divided into letters to inferiors, equals and superiors. A more precise characterization of the relationship may also be important. For instance, depending on the recipient, Hester Piozzi's roles may include 'mother to daughter' (family role), 'author to publisher' (professional role), or 'female friend to female/male friend' (social role). In reality, roles may have been mixed. An annotation scheme including this type of role information would be helpful in answering research questions focusing on changes in social and discursive practices and accompanying linguistic changes. A possible role-related question could then be 'how were the roles of mother and daughter created in the language of sixteenth-century letters in comparison to eighteenth-century letters?'

As research on the CEEC and CEECE has extended outside the original scope of correlational sociolinguistics and morpho-syntactic studies of language change (e.g. Nevala 2004), new requirements for the corpora have been identified. In this situation, it is important that the original corpus can be modified using such tools as the addition of new codes, and that corpus users can effectively pick the texts they need and select new texts. In practice, this usually means more work and additional costs in the form of coding, scanning and proofreading. In addition to providing part-of-speech tagging and syntactic annotation, the latest published version of the CEEC, the Parsed Corpus of Early English Correspondence (PCEEC), provides searchable sociolinguistic information about writer and recipient for each token. This includes such parameters as the age, gender and family roles of the correspondents. Fortunately, plenty of letter- and writer-specific information was coded in a separate database while compiling the CEEC and CEECE, and this can now be used as a basis for further corpus developments. [6] In practice, such developments are not easy to carry out, as suitable ready-made computer programs for combining information from databases and corpora do not exist, but must be created for individual needs.

In some cases, contextual information may be difficult or impossible to obtain for earlier periods, or some of the relevant discursive and social factors such as the question of overhearers may be difficult to account for with any systematicness. Trying to take all factors into account may also lead to a situation were material grows so sparse that it is no longer representative. For instance, in Nurmi & Palander-Collin (forthcoming), we managed to sample an eighteenth-century subcorpus of CEECE where the recipient type was balanced so that the corpus contained an almost equal amount addressed to family and non-family recipients within each category of writers (male and female and gentry and non-gentry). However, it was impossible to control the addressee variables any further, as there was not enough material to start with. Consequently, the variables of social equality between the correspondents and the gender of the recipient were not equally represented, although these proved to be significant, at least in the case of some linguistic forms.

Finally, some contextual factors, such as the gender and age of the informant (provided the year of birth is known and the letter is dated and signed), or family roles in most cases, can be objectively defined. Many others, such as the social status of the informants, their mutual roles in the situation or the topic and purpose of writing, are open to varying degrees of interpretation.

3. Corpus searches

The available corpus search programs are typically character based, offering search options for words or parts of words. This is often enough for many lexical, semantic, morphological and even grammatical studies. This may also be adequate for pragmatic studies if the linguistic function or phenomenon studied can be connected to certain words or, with a tagged/parsed corpus, to word classes or syntactic structures, but this is not always the case. Some language phenomena may be impossible to locate explicitly and comprehensively, such as the identity and relational functions, and linguistic categories may even be diverse and fuzzy, as in the case of reporting.

3.1 Operationalizing identity and relational functions

When research questions concern linguistic functions such as how identities are linguistically constructed/expressed, the units of analysis are not necessarily identifiable as clause- or phrase-level features; for example, they may be sequences in the interaction such as meeting openings that are identified as likely places for identity work between the interactants to take place, or identity construction may be studied through discursive strategies like humour (Schnurr et al. 2007). Such features are impossible or very difficult to identify with character-based corpus tools without first annotating them in the corpus.

I am particularly concerned with identities, social roles and hierarchies in interaction in letters, but how could these identity and relational functions of language firstly be operationalized and secondly be searched for in a large corpus (Palander-Collin 2002, 2006, forthcoming). We can partly work from the present-day angle, investigating features that have been found to be significant in varying types of communication in varying social settings today. The problem here is that we cannot rely on our own first-hand experience and intuition like scholars interpreting language practices in contemporary societies they are living in. Another problem is that many present-day pragmatic studies focus on conversational situations, where such linguistic features as turn-taking, topic changes or the amount of time used by each speaker have been found to be relevant. These features cannot be studied in most historical contexts, since historical evidence is written and very often far from conversational. Communicative situations cannot be observed as they unfold, nor can interactants be interviewed to discover their attitudes to or understanding of the situation.

I have chosen to focus on first- and second-person pronouns, as these have been identified as involvement features, and personal letters in general are characterised as highly involved (Biber 1988, 1995; Biber & Finegan 1989, 1997). The frequency of first- and second-person pronouns has also been shown to be higher in Early Modern English plays, which depict interaction between characters, than in pamphlets of the same period (Suhr 2002: 6-8). Nurmi & Palander-Collin (forthcoming) use the Keyword tool of WordSmith to demonstrate that first- and second-person pronouns are significant words in eighteenth-century letters as compared to newspaper texts (the Zurich English Newspaper Corpus) or even dialogue (A Corpus of English Dialogues 1560-1760). Moreover, first- and second-person pronouns are important features in establishing genre conventions, and they also reveal differences between literary and non-literary text types, such as reciprocal interaction between the self and other in fiction, as opposed to the egocentric I of autobiographies, diaries and travelogues (Taavitsainen 1997: 24, 257). In Present-day English, I occurs more frequently in speech than in writing (Wales 1996: 68). Finally, first- and second-person pronouns are computer-searchable, and their frequencies in large amounts of data can be analysed. As such, they provide baseline evidence for micro-level interpretations.

In historical linguistics, the focus has often been on linguistic forms that have undergone morpho-syntactic changes, or have suddenly entered or disappeared from the language system. In the history of English, second-person pronouns have attracted particular attention, since they have undergone a major system change from a T/V system (YOU vs. THOU) to the sole second-person pronoun YOU, and this change has been studied in a pragmatic framework (see e.g. many articles in Taavitsainen & Jucker 2003). Forms that have not undergone such changes (like first-person singular pronouns) have not attracted as much attention, but, of course, there may be both synchronic and diachronic pragmatic variation in the use of morpho-syntactically stable items.

First- and second-person pronouns are not the only factors related to identity in letters, and I plan eventually to investigate more global writing styles or patterns of interaction characteristic of certain people and situations. Examples (1) and (2) show how the late-sixteenth-century informant Nathaniel Bacon writes to his social superior on administrative matters and to his social inferior on farming matters. First- and second-person pronouns can easily be detected, and I have used quantitative corpus methods to show that there are significant differences in the frequencies of these items in Bacon's letters, differences which reflect social hierarchy and intimacy. These pronouns are most frequently used in Bacon's letters to inferiors and (equal) family members, and least frequently in his letters to superiors, including his own father; the superiors are mostly addressed with the deferential phrase Your Lordship/Ladyship (Palander-Collin 2006, forthcoming).

verie good Lord, the occasion of my writinge nowe and sendinge to your Lordship is because... I ame well assured that your Lordship will not finde fault with any thinge done touchinge this cause when you shalbe let to understand the truthe of the procedinge therein. Thus beseching God to blesse your Lordship with an increase of his holie spirit to his glory and your great comfort, I humblie take my leave... Your Lordships to commande.
(Nathaniel Bacon to Lord North, Steward of the Duchy of Lancaster in Norfolk, Suffolk and Cambridge, 1582, p. II, 227)

Aldred, yf the rentes of West Somerton be ready & gathered, I wold have them delivered to Momforth, the bearer hearof. Toutching your rent corne I am content he shall bargaine with yow either for the whole or for your part....Thus far yow hartely well... Your very friend Nathaniel Bacon
(Nathaniel Bacon to Goodman Aldred, 1570s, p. I, 93)

In our study of eighteenth-century letters (Nurmi & Palander-Collin forthcoming) we demonstrate the existence of very similar frequency patterns for first- and second-person pronouns. In spite of striking similarities between the sixteenth century and eighteenth century, the overall patterns of interaction have most likely changed to some extent during this period, as commentators such as Burke (2000: 45-46) claim that the humiliative style still common in Bacon's time was later criticised, and consequently declined, together with elaborate forms of address and formality. An analysis of cotext and co-occurrence patterns is needed to reveal if such linguistic changes occurred.

First- and second-person pronouns are clearly important loci for the coding of social relationships, but more work is needed to distil the components of Bacon's 'humble' style to superiors and 'firm and friendly' style to inferiors. One possible direction would be to manually analyse the cotext of first- and second-person pronouns. This would involve more qualitative work and close readings of the context; for example, questions could include what kind of communicative tasks these pronouns occur in, which semantic classes of verbs they take, or whether they occur predominantly in formulae or not. Another intriguing direction would be to use computational corpus methods to identify linguistic co-occurrence patterns in a parsed corpus like PCEEC. Biber & Jones (2005) introduce a computational approach that combines corpus linguistics and discourse-analytic perspectives to analyse discourse patterns in a corpus of biology research articles. They apply and develop the methods of multi-dimensional analysis and cluster analysis to identify different vocabulary-based discourse units in the corpus and to analyze the linguistic characteristics of these units. The discourse-unit identification was automated and based on the similarity of vocabulary within a section of language. The lexical and grammatical features included in the analysis were also identified by a tagger. In this case, the analysis produced six different discourse unit types that were maximally similar in their linguistic characteristics. For instance, discourse-unit type 1 was interpreted as 'evaluation of implications and explanations (within the context of current knowledge)' and had particularly high scores on dimension 1 ('evaluation of possible explanations'), which included linguistic features like predicative adjectives, the main verb be, adjective + to -clauses, adjective + that -clauses, adverbs, and prediction modals (Biber & Jones 2005: 168). In the example given, discourse-unit type 1 was found as the tenth and last discourse unit of a biology research article, in its discussion section (Biber & Jones 2005: 171).

3.2 Diverse and fuzzy linguistic categories

In our studies on reporting in eighteenth-century correspondence (Palander-Collin & Nevala 2006, Nevala & Palander-Collin forthcoming) we started off with the grammatical category of reporting, usually identified in grammar books as either direct or indirect on the basis of structure. He said, "I'm exhausted" is direct reporting, where the original speaker's words are repeated verbatim, whereas he said he was exhausted is the indirect version, where the report undergoes mechanical deictic and tense changes. However, when reading Hester Piozzi's letters, we soon noticed that reality was not quite so simple. Examples (3a)-(3f) illustrate the scope of variation. The italics indicate reporting frames in the examples.


  1. Count Grimani says, "I never laughed so much in England as I have done since I came into Wales."
  2. He says however he never laughed so much in England as he has done since he came into Wales. (Hester Piozzi to Hester Maria Thrale, 19.12.1794, p. II, 218)  
  3. He insists on never laughing so much in England as he has done since he came into Wales.
  4. He speaks of his marvellous time in Wales.
  5. his/the report that he never laughed so much in England as he has done since he came into Wales; his/the report of his marvellous time in Wales
  6. I hear/I have been told/it is said he never laughed so much in England as he has done since he came into Wales.

We also discovered that, although say and tell are indeed the most frequent reporting verbs in Piozzi's letters, if we had searched for them alone we would have found barely half of the examples we ultimately included in our analysis. All in all, there were over 50 different reporting verbs, not to mention nominal frames such as that in example (3e). Looking for all these verbs in the corpus would mean reading through thousands of irrelevant examples. In this situation, we felt it best to read the entire text and pick out the relevant examples manually.

Semino & Short (2004) carried out a corpus-based study on reporting using the Lancaster Speech, Thought and Writing Presentation Corpus, where reporting constructions are annotated, but they, too, found reporting constructions and contexts so varied that annotation had to be carried out by human analysts. In our case, we could apply "problem-oriented" annotation to our corpus instead of marking down information about the examples in Excel files, but this would not immediately enhance our research or make it any easier or quicker (cf. McEnery et al. 2006: 43). However, problem-oriented annotation might have encouraged us to return to the examples after more research, and in this case new insights could be added to the annotations. In practice, research projects often progress with a great deal of circularity, and this could perhaps be made more evident and transparent with an annotation scheme. Thus, annotation would contribute to the replicability of research and gradually also to the construction of pragmalinguistic categories to be coded for more general sociopragmatic purposes.

4. Conclusion

Corpus linguistics seems to be one of the major methods and research trends in English linguistics today. This is not surprising, as corpora provide plentiful data as well as a relatively easy access to it. Moreover, corpus methods can basically be applied to a variety of empirical research questions, and, of course, corpus studies have immensely increased our understanding of how language is used. Corpus linguistics is a powerful tool that enables us to detect linguistic tendencies and make generalizations about language variation and change for theoretical as well as practical purposes, such as EFL or ESL teaching and learning. The downside of corpus linguistics is that options currently available, in the form of existing corpora and corpus tools, may steer research towards a computer-friendly mould that leaves other questions on the margins of linguistics. To avoid this, it is important to discuss the kinds of requirements for corpora and corpus tools that arise from a wide array of research questions.

Corpora reflect their compilers' research goals and understanding of what language is. There is no general-purpose corpus fit for every research question, so corpus compilers can benefit users and ensure the high quality of research by being explicit about their compilation principles. Here, I discussed the use of corpora in sociopragmatic research. For this approach, it is important to be able to define the communicative context, as the central question is why people use language in a particular way in a given situation. A sociopragmatic annotation scheme could be helpful for contextualization, although the specific contextual information to be annotated in the corpus would also depend on the nature of the material. In this case, early English letters were contextualized with the help of Fairclough's (1992) three-dimensional model of text, discursive practice and social practice.

Sociopragmatic research questions are not always easily turned into computer-searchable items. This may be because a question primarily concerns a linguistic function, as with the identity and relational functions of language discussed here. Some aspects of identity work, like humour, are impossible to search for unless they have first been coded in the corpus. Even so, we may not have a general understanding of what humour is or was in earlier historical contexts. Some aspects of identity may be operationalized — as first- and second-person pronouns, for example — and searched for in a corpus. Of course, we do not have a comprehensive list of 'expressions of identity', nor is it possible to have such a list, and relevant features have to be identified from the texts. A helpful computational method which makes use of available annotation could perhaps be provided by using parsed corpora for the analysis of co-occurrence patterns. In the case of early English letters, such a mapping might reveal which grammatico-lexical features occur in the various styles that a sixteenth-century gentleman had in his linguistic repertoire.

Even form-to-function mapping may prove surprisingly difficult, as, in spite of neat grammatical descriptions provided by grammar books, the form may in fact be varied and fuzzy, as in the case of reporting. In such a case, the corpus could be annotated for the purposes of a particular research question and for future use in developing more general pragmalinguistic annotations, but the benefits of annotation for the researcher would not be immediately obvious. Both corpus annotation and developing computational corpus methods are always time-consuming and costly exercises, and in many cases also require technical and/or statistical expertise which is not included in typical linguistic training. In this sense, corpus development is a team effort par excellence, calling for people with various areas of expertise.


[1] See, for example, the research statement of the Research Unit for Variation, Contacts and Change in English (VARIENG)

[2] For more details about the CEEC, see also Nevalainen & Raumolin-Brunberg 1996 and Raumolin-Brunberg & Nevalainen 2007.

[3] The CEECS can be obtained through the Oxford Text Archive.

[4] The PCEEC can be obtained through the Oxford Text Archive.

[5] See also Wood 2004, for an application of the critical discourse analysis approach to the letters of Margaret Paston.

[6] Sender database parameters include the following (Nevalainen & Raumolin-Brunberg 1996: 50): last name, first name, title, year of birth, year of death, first letter, last letter, sex, rank, father's rank, social mobility, place of birth, main domicile, migrant status, education, religion, number of letters, number of recipients, type of recipient, number of words, letter contents, letter quality, collection, career, migration history, extra information.


