Studies in Variation, Contacts and Change in English

Volume 18 – Exploring Recent Diachrony: Corpus Studies of Lexicogrammar and Language Practices in Late Modern English

Article Contents

1. Introduction

2. Data and methodology

2.1 Corpus of Late Modern English Texts 3.0

2.2 Multilingual practices in the history of written English

2.3 The quantification of switched passages and some basic descriptive data

2.4 Decision tree partitioning

3. Social, textual and switching parameters in detail

3.1 Social parameters

3.2 Textual parameters

3.3 Parameters of switching

3.4 Overview of all parameters studied

4. Evidence from decision trees

5. Discussion

6. Conclusion





Analysing multilingual practices in Late Modern English: Parameter selection and recursive partitioning in focus

Jukka Tyrkkö, Linnaeus University
Arja Nurmi, University of Tampere


Over the last ten years, historical linguists have paid increasing attention to multilingualism and multilingual practices, defined as the alternating use of resources from two or more languages by a writer within a single text. Most studies in this emerging field have been carried out using relatively small datasets, such as individual genres or texts written by a single person, which allow only limited opportunities for the discovery of broader tendencies. Furthermore, previous studies have typically examined multilingual practices in light of individual factors without attempting to identify significant predictors in a multifactorial setting. With the 34-million-word multigenre Corpus of Late Modern English Texts 3.0 as our primary data we explore the sociolinguistic and textual factors that predicted the use of foreign languages in English texts during the Late Modern period. To identify the most significant variables, we use recursive partitioning, a statistical method in which significance-based splitting is carried out in a stepwise fashion on predictor variables. The decision trees produced by recursive partitioning are easily interpreted and they allow rule-based predictions about the likely frequency of the linguistic feature being investigated.

1. Introduction [1]

Multilingual practices are increasingly the object of study in written language and in the historical stages of languages, thus expanding the viewpoint from the traditional present-day spoken language. Researchers are beginning to abandon the monolingual ideal of texts and increasingly recognising the prevalence of multilingualism in all societies and in all historical periods. While the multilingual practices evident in spoken language have commonly been discussed in terms of code-switching or code-mixing, the terminology for describing the phenomenon in written language is not entirely established. We use the term multilingual practices here, since we wish to highlight the difference between the occurrences of potentially foreign-language elements in historical written texts and the study of present-day spoken phenomena, where a greater variety of clues (e.g. prosody) is available for identifying the languages a speaker is using in any given utterance. In written language, the criteria for distinguishing borrowing and code-switching are much more limited, and in the case of diachronic study, the process of borrowing may be ongoing in the data.

This paper explores multilingual practices in Late Modern English using the Corpus of Late Modern English Texts 3.0 (CLMET3.0) as primary data. We trace the links between social and textual variables and the use of foreign languages in published English texts from the eighteenth and nineteenth centuries, continuing the work started in Nurmi et al. (forthcoming). Our research hypotheses are that (1) the use of foreign languages in English written texts is related to social and textual variables and that (2) recursive partitioning is a useful method for discovering the relative importance of these variables as predictors of language switching. Previous studies of multilingual practices in historical texts – or in writing more generally – have looked at smallish datasets, and only rarely used structured corpora of any kind (see e.g. Pahta & Nurmi 2006, discussing multilingualism evident in the Helsinki Corpus). In addition to investigating a large corpus, we are applying more sophisticated statistical methods than has been customary. The authors, texts and foreign-language passages have received an extensive parameter coding, and through recursive partitioning the most significant variables linked to multilingual practices are identified.

We start by presenting the relevant background in section 2, including an introduction to the CLMET3.0 (section 2.1), a discussion of previous work on multilingual practices in the history of written English (2.2) and an introduction to decision tree partitioning (2.3). We go on to introduce the variables used in our analysis of the data, social (section 3.1), textual (3.2) as well as those related to the types of foreign-language passages (3.3); a summary of the variables is found in 3.4. In section 4, we present the evidence from decision trees, go on to discuss the findings in section 5 and evaluate the procedure in section 6.

2. Data and methodology

2.1 Corpus of Late Modern English Texts 3.0

The Corpus of Late Modern English Texts 3.0 (hereafter CLMET3.0) was compiled by Hendrik De Smet, Hans-Jürgen Diller and Jukka Tyrkkö and released in 2013. [2] The corpus consists of 333 texts, roughly 34 million words, printed between 1710 and 1920. Each text is included in full length. The texts were primarily harvested from Project Gutenberg, with a smaller number of texts collected from the Oxford Text Archive and the Victorian Women Writers Project, primarily to improve the genre and gender balance. The corpus is a principled collection of texts with the aim of providing scholars working on Late Modern English with a representative and reasonably large dataset while still allowing for text-specific attention when needed. The texts are primarily printed works, usually but not exclusively written by well-known authors. There are also a small number of contemporaneously edited collections of letters and newspaper articles. The corpus comes with a genre classification providing six genre labels: narrative fiction, narrative non-fiction, drama, letters, treatises and other. As part of the present project the genre labels were revisited and revised, as described in section 3.2.

2.2 Multilingual practices in the history of written English

This study continues our work in describing the multilingual practices evident in Late Modern English writing, and linking the appearance of foreign languages to sociolinguistic and textual variables. A monofactorial overview of the data was presented in Nurmi et al. (forthcoming), and this paper presents a more multifaceted discussion of the interplay of the variables analysed. The analysis and the variables chosen build on the only consistent body of corpus-based, quantitative work in the field, that of Nurmi & Pahta (2004, 2010, 2011, 2013) and Pahta & Nurmi (2006, 2009, 2011). The choice of sociolinguistic and textual variables for this study has been informed by those studies. The history of multilingual practices has been studied by many other writers, for a selection see e.g. Schendl & Wright (2011). Most work has been of a more philological or quantitative nature, charting specific texts, text types or archives. For a further discussion of these topics, see also Nurmi et al. (forthcoming) and Tyrkkö et al. (forthcoming).

Our choice to call the phenomenon studied here “multilingual practices” is based on several criteria. Firstly, the definitions of code-switching typically focus on criteria developed for studying the linguistic behaviour of balanced bilinguals as members of a linguistic minority group in present-day spoken contexts (see e.g. Gumperz 1982, Franceschini 1998: 53, Auer 1998: 3). As our material is written and historical, and the foreign-language elements in it are evidence of language having been learnt in the course of education rather than acquisition at home or in the spoken environment at large, there is only a certain amount of overlap in the phenomena described. Furthermore, in a diachronic study the question of borrowing and switching is even more blurred than it is in present-day spoken language. A foreign-language element may be available for erudite writers and their readership for a long time, while remaining inaccessible to the less educated strata of the population. This state of affairs may also change over time, as some phenomena and the terms and expressions associated with them occasionally become more prevalent in general parlance. The questions of frequency of usage, topic domains and audiences have a notable role in any attempt to separate borrowing and code-switching in the kind of data we are dealing with, and for the purposes of this study, attempting to make the difference is not relevant. Having said that, we do refer to the foreign-language passages in our data as switched passages and the phenomenon as code-switching for the sake of expediency.

2.3 The quantification of switched passages and some basic descriptive data

The non-English content in CLMET3.0 was identified using a semi-automatic procedure discussed in detail in Tyrkkö et al. (forthcoming). In brief, potential words in languages other than English were first identified using the tagging tool Multilingualiser, after which sequences of two or more such words were subjected to manual analysis. [3] The tool identifies potentially foreign-language items by means of dictionary look-up and user-adjustable weighing of their collocates. In practice, the tool assigns a high ‘foreignness’ value to items that are not found in the English lexicon and a lower value to items that have homonyms in English. Multiple foreign items appearing in sequence are assigned a higher ‘foreignness’ value and items in-between foreign items get a lower value. Chunks that reach or surpass the user-assigned threshold are tagged as foreign and can be later evaluated manually. The version of the tool used in the discovery process included Latin, French, Italian and French dictionaries. The tool also includes a simple character n-gram process for identifying word-initial and word-final bigrams that are unusual or unknown in the English lexis. The items included in the present study were all manually verified by the project team. Greek items, which are often written in the Greek alphabet in the source texts and rendered as OCR errors in the Gutenberg source texts of the CLMET corpus, were manually verified throughout the corpus by project member Jukka Tuominen (see Tyrkkö et al. forthcoming for a discussion). Notably, because the method is based on dictionary look-up, there is a 100% likelihood that all reasonably frequent words in the four languages are discovered, and since nearly all longer chunks of foreign items will include one or more high-frequency items such as function words, the accuracy of the method is high (estimated at 97% for chunk of two words or longer). Although it is possible that a few very short items may have been missed in a corpus of 34 million words, their overall number is in our estimation negligible.

The decision was made early on to exclude single foreign words from the analysis, on the grounds that distinguishing between single foreign and English words would be both extremely time-consuming and difficult to carry out consistently, given the long timeframe of the corpus. Following the initial discovery, the chunks of two or more words were manually evaluated by members of the research team: the language and length (in words) of each chunk was verified, borderline cases were retained or discarded based on the decision of at least two project members, and the switch type of each chunk (Conventionalised, Prefabricated, Free) was determined (see section 3.3). In the manual study of chunks an inclusive approach was taken: only clearly inappropriate instances, such as cases where Multilingualiser had mistakenly identified an English word as foreign or proper nouns, were excluded from the study. In example (1), the word ateliers has been identified by the Multilingualiser as French – which it is. As mentioned above, separating many foreign-origin loans from single-word switches is difficult, and so we chose not to include single-word switches such as this. In the second chunk coded as French, we see the string of words de la Paix; Multilingualiser has not recognised Rue as French, possibly because there is an English homograph (rue the herb). This instance represents a proper name, which we chose to exclude from our data, since in many cases there is no English equivalent that could replace the name. For more discussion on the manual coding process, see Tyrkkö et al. (forthcoming).

(1) He knew some of the shops and ateliers_CSFr in the Rue de_CSFr la_CSFr Paix (Bennett, The Old Wives’ Tale)

Arguably, many cases that are included in the data on the basis of our inclusive policy, such as a priori, can be regarded as part of the English language by the time they appear in the corpus data. They would, however, still be frequently italicised to indicate foreign origin, and they would only be familiar to an educated elite.

Altogether 4,926 chunks were identified, the vast majority of which were in French (1,932 chunks) and Latin (2,152) (for a discussion of precision and recall, see Tyrkkö et al. forthcoming). As mentioned above, the manual analysis was inclusive, and so all ambiguous cases were included at this stage of the analysis. Other somewhat frequent languages were German (110), Greek (270) and Italian (358), in addition to which a further fifteen languages were in evidence in very small amounts (see Appendix). Figure 1 gives the number of chunks for each of the five major foreign languages as well as the proportion of the three switch types (for the latter, see section 3.3).

Major languages other than English in CLEMT3.0  (log2 scale)

Figure 1. Major languages other than English in CLMET3.0 (log2 scale).

For the purposes of quantification, the basic unit of analysis was defined as a switched passage, that is, any instance in the text where the author switches from English to another language for two or more words. The vast majority of passages consist of a few words, and any lengthy passages of text, as, for example, Lawrence Stern in Tristram Shandy switching to Latin for two very long stretches (531 words and 616 words), as well as to French (519 words), are quite exceptional (see Figure 2). The median length of a switched passage was 4 items. In addition to switches from English to a foreign language, there were a few instances of switches between foreign languages, e.g. from French to Latin: these were counted as separate passages. In the present study, these 4,926 switched passages are the only unit of interest and thus the length of the sequence being switched to, as well as the overall volume of foreign content in the text, were left out of the present analysis.

Distribution of lengths of switches passages

Figure 2. Distribution of lengths of switched passages.

The frequency of switching was quantified using the established corpus linguistic method of standardising to a base of 1,000 words. Although this may be seen as less than ideal when it comes to the longer passages, it is worth noting that in each case we are dealing exclusively with full books, and that the overall frequency of foreign items is always very low compared to the full word count of the book. The primary objective of the present study was to identify correlations between the frequency with which authors resort to foreign content and the sociolinguistic and textual parameters of the texts. Earlier monofactorial analysis of the dataset suggested that non-fiction texts and academic authors favour multilingual practices, with university education as the single most predictive sociolinguistic variable (Nurmi et al. forthcoming).

2.4 Decision tree partitioning

Recursive partitioning, also known as decision tree partitioning, is a multivariate statistical method that can be used as an alternative to regression analysis (see, e.g., Strobl et al. 2009). The method is closely related to classification and regression modelling tasks in machine learning, with relevant literature going back to the 1980s. [4] The analysis in the present study was carried out using the statistical package JMP12; in the R environment, the rpart library can be used. Recursive partitioning can also be carried out in SPSS and other statistical software. Because the method is not widely used in corpus linguistics, we shall provide a concise description before proceeding further.[5]

The general rationale behind partitioning methods is to divide a large dataset into smaller sub-groups which are both internally similar and reasonably different from other sub-groups. Typically, the procedure involves a response or outcome variable and a number of factors or predictor variables, and the researcher wants to understand how changes in the predictor variables affect the response variable. A typical partitioning task might include a number of lifestyle variables and an outcome variable indicating the predicted age of death. Using a partitioning model, we can weigh the relative impact of the lifestyle variables and understand their cumulative effects.

The process of building decision trees with the recursive partitioning method is quite straightforward. Starting with the entire dataset, statistical tests for significance are carried out over all the variables and their levels to find the one that gives the best split into two groups (Quinlan 1993). The best split indicates the best predictor of a difference: the first split in a tree identifies the strongest predictor, the second split identifies the next strongest predictor, and so on. Thus, given a set of predictor variables and an outcome variable, the model allows us to identify at each node of the decision tree which predictor we should select next.

The procedure is carried out in a stepwise fashion with the newly created subsections of the dataset, each time splitting the resulting sections into smaller and smaller groups using the variables that remain. [6] In JMP12, the best split is based on statistical significance using the statistic LogWorth, defined as – log10(p-value), where a LogWorth of 0 corresponds with a p-value of 1 and a LogWorth of 2 corresponds to p<0.01 (see Figure 3). A LogWorth of 1.30 corresponds with the conventional p=0.05. The p-value used in calculating LogWorth is an adjusted value that takes into account the fact that the variables may have a varying number of levels and the multiple comparison problem inherent to the model. [7] An unadjusted p-value would favour variables with multiple levels and the conventional Bonferroni correction would favour variables with few levels. [8]

Correspondence of LogWorth and p-value

Figure 3. Correspondence of LogWorth and p-value.

The main benefit of recursive partitioning over regression analysis is that the model can be presented as a relatively easily understood series of decisions with real-world frequency predictions. The order in which the variables are selected for splitting gives the researcher a simple flowchart-like representation of the relative importance of the variables within the model, along with predictions concerning the value of the outcome variable at each node. Although recursive partitioning has not been particularly widely used in corpus linguistics, in our opinion the method could be particularly useful in linguistics, where the majority of scholars are only marginally familiar with more complex statistical models and therefore will have difficulty in interpreting results reported in terms of coefficients of determination, likelihood scores and odds ratios. Moreover, as Lantz (2013: 203) notes, “[r]egression modelling […] makes assumptions about how numeric data is distributed that are often violated in real-world data. This is not the case for trees”. Significantly, issues of distribution are particularly relevant to the analysis of linguistic data, which very rarely follow Gaussian distributions and frequently feature prominent outliers.

One of the well-known issues with recursive partitioning is over-fitting, that is, a situation where the procedure is carried out beyond the point where the statistical model no longer describes relevant data and instead fits a solution to what is in actual fact nothing more than statistical noise. In recursive partitioning, this phenomenon typically occurs when the splitting is carried out past the point of statistically meaningful differences. Consequently, although the splitting of branches can be carried out until every parameter is exhausted and the final leaves give specific predictions for each possible combination of parameter values, it is best to set a cut-off point for the splitting. This can be done either by assigning a minimum number of observations for the branches, below which splitting is no longer carried out, or a significance threshold, as we have done below.

In the present study, our dependent variable was the standardised frequency of switches to foreign-languages. Depending on the question, this could be either the collective frequency of all types of switched passages in the text, or the frequency of switching to a specific language, of a particular type of switch, or some combination thereof. The independent or predictor factors were the social and textual parameters discussed in sections 3.1 and 3.2 below. For the purposes of the recursive partitioning, the factors were nominal variables with a varying number of levels, ranging from 2 to 5.

3. Social, textual and switching parameters in detail

In order to arrive at parameters that could be used in our analysis, a great deal of background information concerning the authors and texts of CLMET3.0 was collected into databases. This information was then organised into parameters, some of which were based on earlier research or on the structure of the corpus itself, while others were more experimental, aiming to gauge, for example, the readership of a text. Some parameters, such as the year of writing or the birth year of the author were numeric, others, such as the author’s university education or whether the book is still in print were yes/no variables, i.e. had two levels, and yet others were given a specified range of values (male/female/unknown or the three Supergenres Drama, Fiction, Non-fiction). Finally, all foreign-language passages were analysed and placed into specific categories, so that they, too, could form a building block in the decision tree model.

3.1 Social parameters

The social parameters used in our study were selected partly from the staples of sociolinguistic variables such as age, gender and social class. The other source for social parameters was earlier work on multilingual practices in the history of English (see e.g. Nurmi & Pahta 2010, 2011; Pahta & Nurmi 2009, 2011). Particularly the role of education — and the unequal availability of it — as well as the relationship between writer and reader and the writer’s adopted role in producing a particular type of text have proved to be significant. Many different social parameters were tested as part of the study, but only some proved significant. Our full initial set of parameters included genre, supergenre, occupation, university education, grammar-school education, other formal education, private tuition, age, gender, social status, and significant length of time spent abroad.

Of the traditional sociolinguistic variables, gender did not prove fruitful; (72/343) authors were women. We believe that this can be explained at least partly by the nature of published texts, that is, that the texts had undergone editorial processes. [9] Similarly, social class was difficult to apply in any meaningful way, as most of the writers could be regarded as members of the middle class, and finer differentiation among the group would have required extensive biographical analysis beyond the data available in the Oxford Dictionary of National Biography (2004). Age, noted in the database as year of birth, did not prove significant. Another social variable included was place of birth (coarse-grained to London/South of England/North of England/Ireland/Scotland/Wales/Abroad), again with no significance. Geographical mobility was tracked in broad yes/no variables, for mobility in Britain (131/343), and for visits to French- and Italian-speaking countries (120/343 and 72/343 respectively), the other variables being visits elsewhere in Europe (115/343), in other English-speaking regions (61/343) and the catch-all category World (56/343), which includes R.L. Stevenson’s visits to the South Sea as well as other people’s travels in Asia, Africa and South America. It should be noted that while some writers travelled extensively others never left their native land (143/343). The monofactorial analysis (Nurmi et al. forthcoming) showed a connection between visiting Italy and switching into Latin, but no other notable patterns with regard to this variable emerged.

In order to group our writers in some meaningful way, we decided to rely on five broad categories of occupation. In the category Academic (30/343) are people who worked at universities or were engaged in research, such as Charles Babbage and Lewis Carroll. The category Cultured (62/343) contains people who made their living in the arts: actors, dancers, composers, painters and editors. Examples of this group include David Garrick and Horace Walpole. Many of them supplemented their income by writing plays, essays, critiques and the like, but writing was not their only occupation and some of them had no need of an occupation, having independent means. Professional Writers (73/343) include people who made their living by writing. Typically this category includes successful novelists like Fanny Burney, but also journalists such as Rudyard Kipling during his early career. Finally, the catch-all category Other (174/343) has all the people who do not fit into the other three. This is a mixed bag, including e.g. civil servants and politicians as well as explorers like James Cook, but also many of our female writers, who did not find it possible to pursue a career during the eighteenth and nineteenth centuries. Finally, there are four works which do not have an identifiable writer: Gilbert Langley, the pseudonymous writer of an “autobiography” and three files containing texts from journals. We further split the category Other into two, Other-Professional (48) and Other-Miscellaneous (132). The first group included a bookseller, publishers, civil servants, members of the clergy, medical doctors, governesses, museum clerks, teachers, etc., while the second group included businessmen, farmers, lay preachers and even a tennis player. Further well-argued divisions do not appear to be possible.

Education in the eighteenth and nineteenth centuries was not equally available to all. While poorer folk barely gained their literacy, if that, the more well-to-do were expected to master a number of foreign languages, most commonly French and Latin, but also others, depending on their social status and occupation. In our current study, education was measured in five yes/no parameters, of which university education (139/343) and education abroad (35/343) proved significant (discussed in detail in Nurmi et al. forthcoming). Our education abroad category includes any level of education from home schooling in a French family to attending a university abroad. The other parameters for education (grammar school, other formal education, private tuition) would have resulted in some foreign-language skills for our writers, but these skills were not materialised in our data in the form of multilingual practices — or at least they could not be linked to a pattern of foreign-language usage (Nurmi et al. forthcoming). In addition to schooling, we also attempted to classify the foreign-language skills of our writers by assigning a set of yes/no parameters with regard to individual languages. These languages were again selected based on earlier research, namely Latin, Greek, French, Italian and German. For these, both schooling and other biographical information were included. So, for example, we assumed that anyone attending either a grammar school or a university would have at least a passing familiarity with Latin. There are also anecdotes of some writers’ language proficiency or their studying languages independently, such as the case of Fanny Burney, who famously taught herself French as a teenager (cf. Pahta & Nurmi 2009: 34). Again, known language proficiency did not correlate with the observed multilingual practices in the data (Nurmi et al. forthcoming).

3.2 Textual parameters

The textual parameters build on the genre labels assigned to the corpus files by the compilers as well as on our attempts to gauge the readership of various texts. The genre classification was modified to some extent, leading to ten genres: Biography, Drama, Essay, Fiction, History, Instruction, Journal (as in periodicals such as Punch), Letters, Travel and Treatise. Because the numbers of items in each genre were relatively low, these were further grouped into three Supergenres based on text typological differences: Drama, Fiction and Non-fiction (the last encompassing eight of the genres) (see Figure 4); the Supergenres comprised 74, 131 and 124 texts, respectively.

Genres and Supergenres in CLMET3.0

Figure 4. Genres and Supergenres in CLMET3.0.

As to the intended readership, this was estimated through three main parameters: the number of editions in the first ten years after publication, the format of the book (quarto, octavo etc.) and children as a specific type of audience. The first two attempt to measure the accessibility of the book in different ways. The number of editions gives a rough estimate of the popularity of the volume, meaning that the more it was reprinted, the more people bought and read it. Similarly, the format of the book gives a good indication of the price range of the volume: the larger the book, the higher the price, and vice versa. This was established, as we collected data not only on the size of the book but also on the price. Once it became evident that the two were in fact interdependent, we decided to focus on the more easily found volume size. Both the number of editions and the size of the book were coded using online legal deposit library catalogues (the British Library, and Oxford and Cambridge university libraries ), as well as the English Short Title Catalogue for volumes published prior to 1800. Another textual parameter that made the final cut was the author’s anonymity at first publication. This was mainly included since it was reasonably common in the data: 99 of the texts were published anonymously.

Other textual parameters considered but rejected from further analysis included the place of publication (major vs. minor cities as a further indication of readership), the existence of American editions within the first ten years of publication and whether the book is still in print (defined as the twenty-first century). The place of publication showed too little variation to be considered as a viable parameter, and the American editions would have required further study as to any changes that might appear in them. Also, the third criterion, whether the book is still in print, proved impracticable. Since the bulk of the corpus uses texts from Project Gutenberg, the texts have been relevant for current readers, and are overwhelmingly still in print.

Finally, there was some attempt to create parameters based on the content of the text. Mainly, we were interested to find out if a text taking place in a foreign context would trigger the appearance of foreign-language passages. This was coded with three possible values: yes, no and not applicable. The first two refer to texts which have a location: drama, fiction, travelogues, letters etc. The not applicable category applies to academic treatises, but also to non-existent fictive surroundings, such as Alice’s Wonderland. This parameter was based on only a cursory browsing of a text rather than a close reading, although many texts were, of course, familiar in themselves already. In the monofactorial analysis (Nurmi et al. forthcoming) it became apparent that a novel taking place in France would trigger switching into French, and a travel text describing distant places would include switches into relevant languages.

3.3 Parameters of switching

The foreign-language passages of two or more words in our data, identified using the Multilingualiser corpus tool and further analysed by us were classified to provide a further range of parameters. Each passage was coded for language used, length in orthographical words and type of switch. The identification of the rarely occurring languages (see Appendix) is somewhat tentative, but the bulk of the data is reliably classified. Our use of the term switch here refers to the use of linguistic elements which are identifiably of foreign origin. Our interest is in a wide range of multilingual practices, and, as mentioned above, we have included many expressions that occupy the grey area between English and other languages, particularly in the type Conventionalised switches.

Based on earlier research (see e.g. Nurmi & Pahta 2010, 2011, 2013; Pahta & Nurmi 2009, 2011), we identify three switch types: Conventionalised, Prefabricated and Free switches (see Figure 1 and the Appendix for most frequent languages and switch types in them). The Conventionalised switches are typically 2–3 words long, and often exist in the fuzzy area between borrowing and switching. They may appear in dictionaries and may well be part of the professional lexicon of a particular field. Expressions such as terra firma (2), entre nous (3) and gruss’ dich (as it appears in the corpus) are easily used and understood in context with very little or no skill in the language in question, but retain nevertheless some element of foreignness, evidenced by the fact that they are frequently italicised in the source text (Nurmi in prep.). Some are simple greetings of the type bonjour monsieur, which are taught as the first step of learning a foreign language, while others, such as ipso facto, belong in more academic writing.

(2) his scientific observations may not have been so complete as they would have been on terra firma. (Bacon, The Dominion of the Air)
(3) In parting with Walter, Courtland shook his head, and observed: - ‘Entre nous, Sir, I fear this may be a wildgoose chase.’ (Bulwer-Lytton, Eugene Aram)

The second type of switching identified in our data are the Prefabricated switches. These are phrases and longer passages the author has not formulated him/herself but is reporting from another source. Typical examples include proverbs and quotations from classical sources (Horace and Vergil being two favoured authors in our data), as in (4). Also included in this category are words the author has heard and is reporting in writing as in (5). Both producing and understanding this type of switching is more demanding than in the case of Conventionalised switches. Still, the writer is aided in not being required to form the expressions, only reproducing them, and the reader has probably encountered many familiar quotations and maxims in other texts as well, thus being able to understand the content without being able to analyse the language used in more detail. Obviously reported speech (as in 5) requires more skill in the other language than looking up a quotation and reproducing it in writing, but compared to the more formulaic Conventionalised expressions on the one hand and the independently produced Free switching on the other, this type of switching occupies a middle ground as to linguistic skills required to produce it.

(4) All which, from the words, De gustibus non est disputandum, and whatever else… (Sterne, The Life and Opinions of Tristram Shandy)
(5) D’Aubreu, the pert Spanish minister, said the other day at court to poor Alt, the Hessian, ‘Monsieur, je vous félicite, Munster est pris.’ (Walpole, Letters)

The third type of switching in our data is Free switching. This resembles most closely the phenomenon observed in the speech of bi- or multilingual individuals. Free switching typically appears in novels and drama, and illustrates the writer’s linguistic skills, at the same time describing the characters of the work (6). This is the least common type of switching in our data, which may be related to the nature of written, published texts. Writers tend to be aware of their readers’ linguistic skills (Nurmi & Pahta 2010; Pahta & Nurmi 2009), often tailoring their foreign-language expressions to a particular audience. When the audience is broad, or largely unknown, it makes sense to avoid too much Free switching, as this may well be the type requiring not only the best language skills in writing, but also the most fluency in reading.

(6) He said he was an Italian who had the profoundest admiration for England. I said at once – ‘Lei non può amare l’Inghilterra più che io amo ed ammiro l’Italia.’ The Manning-Parry barrister looked up with an air of slightly offended surprise. (Butler, Note-Books)

3.4 Overview of all parameters studied

Table 1 presents an overview of the social, textual and switching parameters studied, as well as an indication of whether previous monofactorial analysis (Nurmi et al. forthcoming) showed the parameters to be significant in terms of multilingual practices in our data. More detail on each parameter is found in sections 3.1–3.3.

Parameter type Parameter Significant (yes/no)
Social Age no
Education, abroad yes
Education, grammar school no
Education, other formal no
Education, private tuition no
Education, university yes
Gender no
Geographical mobility yes
Known language proficiency no
Occupation yes
Place of birth no
Social class no
Textual Authorship, anonymous no
Genre no
Readership, children no
Readership, format no
Readership, number of editions no
Supergenre yes
Text location yes
Switching Conventionalised yes
Prefabricated yes
Free yes

Table 1. Overview of social, textual and switching parameters.

4. Evidence from decision trees

Although a decision tree can be built using any number of predictor variables, or factors, our relatively small corpus (in terms of the number of texts) does not allow for statistically meaningful analysis of a large set of variables — there are simply not enough items that would match any one multivariate combination. Thus, the beginning of the analysis required the identification of a small number of the most meaningful predictors. The factors were identified heuristically by running models with various combinations of the variables available. During this pruning process, as mentioned in section 3.1, variables such as the author’s gender, age and educational background other than university attendance were found to fall consistently below the significance threshold. This does not mean that these variables are of no interest, but they do not appear to serve as significant predictors when it comes to the use of foreign words. In the end, three factors were included in the partitioning: Supergenre (3 levels: Drama, Fiction, Non-fiction), University education (2 levels: Yes, No) and Occupation (5 levels: Academic, Cultured, Professional Writer, Other-Professional and Other-Miscellaneous). The small number of periodicals were removed from the dataset due to the fact that they represent a highly mixed genre (fiction and non-fiction, including very short and fragmented texts from a wide variety of writers), leaving 329 items of the original 333 texts in the CLMET3.0.

Figure 5 gives the first two splits of the dataset. Each box gives the count of items in that group, the mean frequency of the independent variable (switches to foreign language per 1,000 words), its standard deviation, and the LogWorth score of the next split. It is worth reiterating here that the process of splitting is entirely data driven, and that the choice to split the data is unsupervised. Because each split can only produce two new branches, when the factor in question has more than two levels left, one or both of the new branches will include more than one level.

Reading from the top, the first split divides the corpus into two branches, placing Fiction and Drama texts into one group (left-side branch), and Non-fiction texts in the other (right-side branch). This means that of the three predictor variables included in the model, Supergenre is the best predictor of overall foreign content. The Fiction and Drama group gives a mean frequency of 0.10/1,000 words, while the Non-fiction group gives 0.19/1,000 words. The split is highly significant at a LogWorth score of 2.85. The second split takes place in the left-hand branch, splitting the Drama and Fiction branch based on university education: authors with no university education use switches at a frequency of 0.07/1,000 words, while those with university education do so at a frequency of 0.14/1,000 words. However, we note that the Logworth score falls to 1.11, or an adjusted p-value of 0.06. The third split again takes place in the left-hand branch, splitting the non-university educated group between Cultured authors (on the right) and all other authors. The LogWorth score is now 0.91, equivalent to a high p-value of 0.12. Note that the second and third split are the best splits available for the tree at that point. The lack of further splits in the right-hand branch means that the internal differences within the left-hand branch were greater than any possible splits between the Non-fiction texts.

How do we interpret this result? The highly significant split between the Supergenres indicates that Non-fiction differs considerably from Drama and Fiction by featuring significantly more switches. The second split, which is a borderline case when it comes to statistical significance, shows that the author’s education, specifically university education, is the strongest predictor when it comes to Drama and Fiction. In other words, the author’s education matters more than whether they are writing Drama or Fiction, or what their specific occupation was. Beyond the second split, the differences between the predictors are not statistically significant. Although the third split may be interpreted as a suggestion of a possible trend, it is important to bear in mind that it should not be taken as reliable and we include it here merely to demonstrate how splitting may be continued beyond the statistically significant level. In the figures that follow, we show statistically significant splits exclusively. However, we will also include splits with LogWorth scores between 1.3 and 1.15 (p-values from 0.05 to 0.07) as we consider them sufficiently trend-indicating.

Decision tree for all switched passages (N=329 items)

Figure 5. Decision tree for all switched passages (N=329 items).

As noted, Figure 5 shows the partitioning results for the overall switches to foreign languages. Additional decision trees can be produced for specific response variables, such as specific languages or switch types, or their combinations. Figure 6 shows the decision tree for switches to Latin (see example 2 above), perhaps one of the most recognisable types of multilingual practices in educated written language.

Decision tree for all switches to Latin

Figure 6. Decision tree for all switches to Latin.

Figure 6 shows that although the first split is again by Supergenre (LogWorth 7.32), the second split is different when it comes to the two branches. Drama and Fiction are split according to the author’s occupation (LogWorth 1.6, p-value 0.02), while Non-fiction texts split into two according to university education (LogWorth 1.2, p-value 0.06). Predictably, the data shows that the highest frequency of Latin switches is found in Non-fiction books written by University-educated authors (standardised frequency 0.13/1,000) while the lowest is found in Drama and Fiction written by authors whose occupation is something other than Cultured. Once again we find that further splits are not statistically significant. When the process was repeated for French, Italian and German, no significant results appeared.

Greek, on the other hand, gave significant results. Like Latin, Greek shows an initial split between Drama and Fiction, and Non-fiction (LogWorth 1.79; Figure 7). While the second split is not statistically significant (LogWorth 1.03, p-value 0.09), we wanted to include the figure as an example of a case where a further split once again shows a significant difference (LogWorth 1.76, p-value 0.017). Although the low LogWorth of the second split means that the results should be taken as nothing more than an indication of a possible trend, the result of the partitioning agrees with conventional logic. In fact, looking at the standardised frequencies, we see that the Drama and Fiction by University-educated Cultured authors show a higher frequency of switching than we find in Non-fiction texts.

Decision tree for all switched passages to Greek

Figure 7. Decision tree for all switched passages to Greek.

We can use the same method when looking at other response variables. For example, Figure 8 gives the partitioning tree for Conventionalised switches (2,234 hits). Only the first split is statistically significant (LogWorth 2.54), namely occupation. Academic and Cultured authors use Conventionalised switches roughly two times more than Professional Writers, Other-Professional and the Other-Miscellaneous.

Decision tree for Conventionalised switches regardless of language

Figure 8. Decision tree for Conventionalised switches regardless of language.

When we turn to the 3,199 Prefabricated switches (Figure 9), typically quotes and proverbs, we see that Supergenre is the strongest predictor: Prefabricated switches are nearly five times more common in Non-fiction than in Drama and Fiction. The very high LogWorth score confirms that this is a highly significant difference. Within Drama and Fiction, the author’s occupation is a next best predictor, with Cultured authors using Prefabricated switches nearly four times more than the other groups.

Decision tree for Prefabricated switches regardless of language

Figure 9. Decision tree for Prefabricated switches regardless of language.

The least numerous type of switching is Free switching. The 832 hits do not give statistically significant differences.

5. Discussion

The analysis indicates that multilingual practices in Late Modern English printed books, written by English-speaking authors for English-speaking audiences, are a highly varied and complex phenomenon. Using the same three predictive factors, each language and different type of switched passage produces a somewhat distinctive decision tree. This is a significant finding for a number of reasons.

Firstly, the evidence suggests that the high or low frequency of switched passages cannot be predicted on the basis of any single variable. We cannot draw the simple conclusion that authors of a particular occupation, educational background, age or gender would have been consistently more or less likely to use foreign languages. What we can say is that multilingual practices were generally more common in Non-fiction than in Fiction or Drama, that Cultured authors usually stand out as the group most likely to have mastered foreign languages, and that University-educated authors are somewhat more likely to use foreign languages, especially the classical languages. These results are not notably different from those obtained through monofactorial analysis in Nurmi et al. (forthcoming).

The findings presented can be considered from the perspectives of writer resources, audience design and identity work. University education gave writers linguistic resources, most notably Latin, but also very often familiarity with contemporary European languages, particularly the lingua franca of the European upper classes, French. People with a university education could end up in all four of our occupational groups. Similarly, Cultured authors would have come into contact with Europe-wide cultural phenomena, and gained some competence in the most important languages in their field. French was the language of the high culture, but in music Italian was significant from the eighteenth century onwards, and even a cursory familiarity with Classical sources would have provided some familiarity with the shared European legacy of Greek and Roman authors.

Audience design has been established as one of the important influences in multilingual practices (see e.g. Nurmi & Pahta 2010). Writers are to some extent aware of their audience, and suit their writing to the intended readers. This may to a certain degree explain the difference between Non-fiction and Fiction and Drama. Fiction and Drama are more likely aimed at a very wide readership, including audience members with only a basic education. This is even more accurate in the nineteenth century, when popular fiction started increasing. While Non-fiction in the corpus represents a wide variety of topics, much of it is aimed at a somewhat expert audience – or is describing foreign countries near or far. In the case of Cultured authors, their audience expectations may have been different from other writers. Unlike Professional Writers, who typically wrote for a large audience, Cultured authors may well have expected their readers to resemble the authors in their linguistic profile. While the model used does take exceptional authors into account, it should be noted that some of the largest corpus files were produced by three Cultured authors, Edward Gibbon, Samuel Richardson and Horace Walpole. Richardson’s works are Fiction, but the other two are found in Non-fiction. Particularly Gibbon’s Decline and Fall of the Roman Empire is riddled with foreign-language passages in a wide variety of languages, but also Walpole’s letters are rich in French and Latin elements.

As to identity work, the connection to university education seems fairly obvious. Educated people may well have expressed their identity as members of the educated elite (consciously or not) by inserting Latin passages in their writings. Similarly Cultured writers may have valued their own familiarity with foreign languages associated with cultural endeavours, and their ease of dropping French and Latin phrases into texts may have been a vital element of their identity as writers, a way of showing their own place in the elite of the Cultured, even if not in the financial or even in all cases the educated elite of the time.

6. Conclusion

The results of our study show that for most of the social and textual variables analysed a connection to multilingual practices could not be established. The role of the three Supergenres could be noted, and Non-fiction seems to include more switched passages than Fiction or Drama. This could be linked to audience design, as the readers of fiction can be seen as a more heterogeneous group with regard to their language skills than those of the non-fiction texts included in our data. In line with earlier findings (e.g. Nurmi & Pahta 2011, Pahta & Nurmi 2009), university education proved significant, particularly in connection with the use of Latin. This reflects the unequal access to education during the eighteenth and nineteenth centuries, where particularly knowledge of Latin was limited to the highly educated. Finally, the ways in which the group of writers we named Cultured display their knowledge of foreign languages in their texts may well be related to their identity work as members of the cultural elite of the time. Variables such as the author’s gender and age, on the other hand, did not prove significant in our study.

The results obtained through the application of decision trees to our data do not differ greatly from those gained through monofactorial analysis in Nurmi et al. (forthcoming). It would seem that for this type of data, either method can be regarded as reliable. There are, however, benefits in testing data with more than one approach. The methodological objective of this article was to argue that decision trees, or recursive partitioning models, are a valuable but under-used method for analysing multivariate linguistic data, but our results also indicate that the methods would most likely be more useful with larger datasets than what we had at our disposal here. While we do not suggest that recursive partitioning is a replacement for regression or mixed-effect modelling in all situations, we do argue that decision trees can offer a more intuitive and accessible way to understand complex data.


[1] The research discussed in this paper is a part of the Multilingual Practices in the History of Written English project (258434) funded by the Academy of Finland. The research reported here benefited from the work of junior researchers Anna Petäjäniemi, Jukka Tuominen and Veera Saarimäki. [Go back up]

[2] A CQP-ready version of CLMET3.0 was released in October 2015 by Hendrik De Smet, Susanne Flach and Jukka Tyrkkö. The new version of the corpus, CLMET3.1, also comes with a new cleaned-up version of part-of-speech tagging. Like CLMET3.0, the new corpus is freely available from Hendrik De Smet at https://perswww.kuleuven.be/~u0044428/. [Go back up]

[3] Multilingualiser was developed by Tyrkkö during the present research project. The software will be made freely available when the final testing phase is over. Multilingualiser will run on OS X, PC and Linux. [Go back up]

[4] The Classification and Regression Tree (CART) algorithm was introduced in Breiman et al. (1983). See also Neville (1999) and Lantz (2013). [Go back up]

[5] See also Cuyckens & d’Hoedt (2015). [Go back up]

[6] In earlier sociolinguistic studies, stepwise regressions were commonly carried out using the Varbrul method. [Go back up]

[7] With categorical variables, each of the possible values is called a level. Thus, for example, in our analysis the possible levels of the factor Occupation are Academic, Professional Writer, Cultured and Other-Professional and Other-Miscellaneous. The multiple comparison problem is a well-known issue in statistics that comes about when multiple factors and levels are considered at the same time. The more factors one includes in the model, the more likely it is that significant differences are observed purely by chance. There are various statistical methods to control the problem. [Go back up]

[8] The adjusted p-value was formulated empirically using a Monte Carlo calibration. The adjusted p-values produce better null-case distributions than the traditional unadjusted or Bonferroni-adjusted methods. For full details, see Sall (2002). [Go back up]

[9] The social parameters database has 343 lines. Basically, this means one line for each volume, but in case of multiple authors, each author was given their own line. There are several cases where more than one work from a single author has been included in the corpus, which means that one author may have more than one line. So, for example, the eight different lines for members of the aristocracy in the social class column refer to five men. Each line reflects the biographical data appropriate for the time of publishing that volume, and so the particulars of foreign travel, for example, might be different at different stages of a person’s life. [Go back up]


Cambridge University Library Catalogue. s.a. Available online at http://www.lib.cam.ac.uk/camlibraries/catalogues.html

CLMET3.0 = Corpus of Late Modern English Texts 3.0. Compiled by Hendrik De Smet, Hans-Jürgen Diller & Jukka Tyrkkö. See https://perswww.kuleuven.be/~u0044428/

English Short Title Catalogue. British Library. Available online at http://estc.bl.uk

Explore the British Library / Main Catalogue. s.a. Available online at http://explore.bl.uk/

Oxford Text Archive. Available online at https://ota.ox.ac.uk/

Oxford University Library Catalogue. s.a. Available online at http://solo.bodleian.ox.ac.uk/

Project Gutenberg. 2004–2017. Online eBook depository. Available online at http://www.gutenberg.org

Victorian Women Writers Project. 1995–2017. Available online at http://webapp1.dlib.indiana.edu/vwwp/welcome.do


Auer, Peter. 1998. “Bilingual conversation revisited”. Code-switching in Conversation. Language, Interaction and Identity, ed. by Peter Auer, 1–24. London: Routledge.

Breiman L., J.H. Friedman, R.A. Olshen & C.J Stone. 1983. Classification and Regression Trees. Belmont, CA: Wadsworth.

Cuyckens, Hubert & Franke d’Hoedt. 2015. “Variability in clausal verb complementation: The case of admit”. Perspectives of Complementation. Structure, Variation and Boundaries, ed. by Mikko Höglund, Paul Rickman, Juhani Rudanko & Jukka Havu, 77–100. London: Palgrave Macmillan.

Franceschini, Rita 1998. “Code-switching and the notion of code in linguistics”. Code-switching in Conversation. Language, Interaction and Identity, ed. by Peter Auer, 51–72. London: Routledge.

Gumperz, John J. 1982. Discourse Strategies. Cambridge: Cambridge University Press.

Lantz, Brett. 2013. Machine Learning with R. Birmingham: Packt Publishing.

Matthew, H.C.G. & Brian Harrison, eds. 2004. Oxford Dictionary of National Biography. Oxford: Oxford University Press.

Neville, Padraig. 1999. Decision Trees for Predictive Modelling. SAS Institute, Inc.

Nurmi Arja. In preparation. “Multilingual practices in early modern letter writing manuals”.

Nurmi, Arja & Päivi Pahta. 2004. “Social stratification and patterns of code-switching in early English letters”. Multilingua 23: 417–456.

Nurmi, Arja & Päivi Pahta. 2010. “Preacher, scholar, brother and friend: Code-switching and social roles in the writings of Thomas Twining”. Social Roles and Language Practices in Late Modern English, ed. by Päivi Pahta et al., 135–162. Amsterdam & Philadelphia: John Benjamins.

Nurmi, Arja & Päivi Pahta. 2011. “Multilingual practices in women’s English correspondence 1400–1800”. Language Mixing and Code-Switching in Writing: Approaches to Mixed-Language Written Discourse (Routledge Critical Studies in Multilingualism 3), ed. by Mark Sebba, Shahrzad Mahootian & Carla Jonsson, 44–67. New York & London: Routledge.

Nurmi, Arja & Päivi Pahta. 2013. “Multilingual practices in the language of the law: Evidence from the Lampeter corpus”. Ex Philologia Lux: Essays in Honour of Leena Kahlas-Tarkka (Mémoires de la Société Néophilologique de Helsinki XC), ed. by Jukka Tyrkkö, Olga Timofeeva & Maria Salenius, 187–205. Helsinki: Société Néophilologique.

Nurmi, Arja, Jukka Tyrkkö, Anna Petäjäniemi & Päivi Pahta. Forthcoming. “The social embedding of multilingual practices in Late Modern English”. Multilingual Practices in Language History (Language Contact and Bilingualism 15), ed. by Päivi Pahta, Janne Skaffari & Laura Wright. Berlin: Mouton de Gruyter.

Pahta, Päivi & Arja Nurmi. 2006. “Code-switching in the Helsinki Corpus: A thousand years of multilingual practices”. Medieval English and its Heritage, ed. by Nikolaus Ritt et al., 203–220. Frankfurt: Peter Lang.

Pahta, Päivi & Arja Nurmi. 2009. “Negotiating interpersonal identities in writing: Code-switching practices in Charles Burney’s correspondence”. The Language of Daily Life in England (1400–1800), ed. by Arja Nurmi et al., 27–52. Amsterdam & Philadelphia: John Benjamins.

Pahta, Päivi & Arja Nurmi. 2011. “Multilingual discourse in the domain of religion in medieval and early modern England: A corpus approach to research on historical code-switching”. Code-switching in Early English (Topics in English Linguistics 76), ed. by Herbert Schendl & Laura Wright, 219–251. Berlin & Boston: Mouton de Gruyter.

Quinlan J.R. 1993. 4.5 Programs for Machine Learning. San Mateo: Morgan Kaufmann.

Sall, John. 2002. “Monte Carlo calibration of distributions of partition statistics”. SAS Institute. http://www.jmp.com/content/dam/jmp/documents/en/white-papers/montecarlocal.pdf

Schendl, Herbert & Laura Wright, eds. 2011. Code-switching in Early English (Topics in English Linguistics 76). Berlin & Boston: Mouton de Gruyter.

Strobl, Carolin, James Malley & Gerhard Tutz. 2009. “An introduction to recursive partitioning: Rationale, application and characteristics of classification and regression trees, bagging and random forests”. Psychological Methods 14(4): 323–348.

Tyrkkö, Jukka, Arja Nurmi & Jukka Tuominen. Forthcoming. “Semi-automatic discovery of code-switching from English historical corpora: Methods and challenges”. Challenging the Myth of Monolingual Corpora (Language and Computers: Studies in Digital Linguistics), ed. by Arja Nurmi, Tanja Rütten & Päivi Pahta. Amsterdam: Brill.


Language chunks of two or more words in CLMET3.0

Language Conventionalised Free Prefabricated TOTAL
Arabic 1 3 14 18
Buginese(?) - - 1 1
Dutch - - 3 3
Egyptian(?) - - 2 2
French 875 466 591 1932
Gaelic - - 1 1
German 16 17 77 110
Greek 3 12 255 270
Hawaiian(?) - - 1 1
Hindi - 1 - 1
Indonesian - - 6 6
Irish - 6 - 6
Italian 37 32 289 358
Latin 683 71 1398 2152
Lenape(?) - - 1 1
Malay - - 1 1
Malay(?) - - 1 1
Portuguese 2 - 20 22
Romani / Caló 1 1 6 8
Samoan - - 1 1
Sinhalese - - 1 1
Spanish 3 1 21 25
Spanish(?) - - 1 1
Total 1621 610 2691 4922

University of Helsinki