Note that the figures also include the age of persons only mentioned during, but not necessarily present at, the trials. Nevertheless, the above gives a good impression of the situation: only from the 1790s onwards is age more systematically mentioned in the original Proceedings (and therefore tagged in the electronic version), and only from then on is the age structure of society more accurately reflected in the markup. The aim is to provide more age information for the time predating the 1790s during the annotation of the OBC.
The gender structure of the speakers in the Proceedings is indicated in Table 2 and Figure 2. Since speech passages and speakers are not identified in the original electronic version, an indirect approach via the sex of the defendants was taken here.
On average, 72.6% of the defendants are men, and it is expected that the final version of the OBC will contain roughly the same percentage of male speakers. It might be objected that a corpus for sociolinguistic research should aim for a balanced representation of the genders, but this is impossible with the Proceedings since the trials always involve males (judges, prosecutors, lawyers, etc. were exclusively male) but not necessarily females. In addition, the very high number of speakers ensures that even in decades where the percentage of women is low, as in the 1820s and 1830s, their absolute number is still in the thousands.
The Old Bailey Proceedings Online lists over 4,000 occupations and status labels of the participants in the trials at the Old Bailey, from accoutrement-maker to yeoman. These labels are given in their original spelling, they are sometimes more detailed than one would need them to be (for example we find servants to blacksmiths, to gentlemen, to goldsmiths, to leather-sellers, to midwifes, to poulterers, to public houses, to washerwomen, etc.). After standardization of these labels, the actual number of occupations will be much lower and easier to handle for the end user.
The above-mentioned constraints limit the usefulness of the Old Bailey Proceedings Online for linguistic purposes and were one of the motivations for turning the Proceedings into a linguistic corpus. Since the Proceedings cannot be downloaded from the website in their entirety and individual trials are only displayed as raw text, without tags, a copy of the XML-tagged version was obtained from Tim Hitchcock and Robert Shoemaker. This version is currently annotated by myself and my team. The major task is to identify speech passages and link them to sociobiographical speaker parameters such as sex, age, or profession.
This section starts with a description of how spoken language is distributed in the Proceedings. This will be followed by two subsections assessing the reliability of these trial accounts as a linguistic source, first by considering external factors surrounding the genesis of these texts and by comparing a trial from the Proceedings to an alternative account (3.2), and second, by testing their internal linguistic consistency through a quantitative analysis of negative contraction (3.3).
Figure 3 shows the number of 1st and 2nd person singular and plural pronouns as a rough measure of the amount of direct speech reported in the first six decades of publication of the Proceedings. Figure 4 relates the number of pronouns to the total number of words by indicating mean frequencies of pronouns. The reason for relying on this indirect approach via pronouns is that formal text-structuring conventions of marking direct speech varied a lot in the early years and makes automatic tagging (see 4.2) almost impossible. The pronoun forms counted are I, my, mine, me, myself, you, your, yours, yourself, yourselves, thou, thy, thine, thee, thyself, thee, we, ours, us and ourselves. Our was excluded because it frequently occurs in 'our Lord the King'. As there are a number of alternative versions of the Proceedings in the early years, only the longer version was included in the count.
The figures show that direct speech became more common only in the 1720s, although there is some measure of spoken language even in earlier trial accounts, particularly in the 1674-1679 and 1692-1695 periods. The comparatively high amount of direct speech in 1678, 1692 and 1706 is due to individual Proceedings, 16781211, 16920406 and 17061206, which report considerably more spoken language than the other proceedings in those years. A closer look at these pre-1734 Proceedings reveals that a good part of the direct speech was not originally uttered in court but is actually embedded in 3rd person narration. That is, the spoken language reported in these early accounts is not that of plaintiffs, defendants or other participants in the lawsuit but that of a third party, as illustrated by the following excerpt:
Watson for himself said, That being ordered by the Plaintiff to Arrest Dorothy Midgley, when he came to the door, he heard the Boy say, I will run my Spit in some of your guts; but putting him aside, he Arrested his Prisoner, and heard some body cry out, I am killed; upon which he run to him … (16781211-23, my emphasis)
These spoken passages are generally short and there is little information on the sociobiography of the speakers. However, the major limitation of their usefulness is the fact that there is a considerable time lapse (weeks or even months) between the original speech event and its recording. The reliability of the data is further diminished due to the intermediary role of the person reporting the utterance in question, who is the immediate source for what the scribe takes down.
Figure 5 gives the total number of words per decade, as well as the proportion of direct speech from 1734 onwards:
From the 1730s onward, a relatively high proportion (almost 85%) of the Proceedings is made up of spoken language. The Proceedings therefore constitute a rich source of data for the study of speech in the 18th and 19th centuries.
It has been argued that from a historian's point of view the material reported in the Proceedings is rather accurate:
Although initially aimed at a popular rather than a legal audience the material reported was neither invented nor significantly distorted. The Old Bailey Courthouse was a public place, with numerous spectators, and the reputation of the Proceedings would have quickly suffered if the accounts had been unreliable. Their authenticity was one of their strongest selling points, and a comparison of the text with other manuscript and published accounts of the same trials confirms that they accurately report what was said in court. (Hitchcock & Shoemaker 2007b)
But Hitchcock & Shoemaker go on to caution that a comparison with alternative accounts of the same trials show that the Proceedings are not complete — though often the most comprehensive account — and that even the most detailed later Proceedings are only partial transcripts of what was said: 'At the very least, in an attempt to save space, minor details and repetitions, perceived as unimportant, were frequently left out of recorded testimony' (Hitchcock & Shoemaker 2007b). In spite of this, and in the absence of better data, the records of the trials at the Old Bailey are arguably as near as we can get to the spoken word of the 18th and early 19th centuries.
As shown above, about 85% of the text in the Proceedings from the 1730s onwards is direct speech. For a linguist trying to reconstruct the speech of the period, an important development in the Proceedings is the switch from third-person to first-person accounts in the 1710s. The early Proceedings tended to give more or less judgmental — and sometimes sensationalist — accounts of the 'most notable trials', as in the trial of Elizabeth Scot for theft on 16 January 1682:
The detail reported for individual trials increased considerably in the 18th century, when scribes reported witness testimonies, statements and arguments of the prosecution and the defence, cross-examinations, etc. Compare Extract 1 with the following extract from the trial of Elizabeth Whitney on 27 February 1740, which includes monologues as well as shorter question-answer exchanges and amounts to over 1,500 words:
Historical reliability as described by Hitchcock & Shoemaker 2007b is not the same as linguistic reliability. The omission or misrepresentation of factual detail in a historical document does not necessarily mean that the spoken language reported in that same document is unreliable. As an example, I will consider the recording of non-standard features, which could be taken as an indication of linguistic faithfulness. The Proceedings are generally written in standard orthography, but sometimes we find non-standard pronunciation (and morpho-syntax) in individual speakers, such as in the following deposition by an Irishman:
James Fitzgerald depos'd to this Effect: On the 25th of February last, about 11 at Night, O' my Shoul, I wash got pretty drunk, and wash going very shoberly along the Old-Baily, and there I met the Preeshoner upon the Bar, as she wash going before me. I wash after asking her which Way she wash walking: And she made a Laugh upon my Faush, and told me to Newtoner's-Lane. […] (17250407-66)
Non-standard phonological and morpho-syntactic detail of this kind is often found in the speech of Irishmen and other foreigners. A certain degree of stereotyping for comic effect on the part of the scribe cannot completely be ruled out, especially if the speaker in question dominates the trial in terms of length of utterance, as in this case. Incidentally, the publisher 'earned a censure from the City authorities for the 'lewd and indecent manner' in which the trial was reported' (Hitchcock & Shoemaker 2007a, Shoemaker forthcoming), which is an indication of the control that the City exerted not only on what was reported by also on the language in which it was reported (see also 3.3.1).
Sometimes, however, non-standard passages are embedded in otherwise completely serious discourse, with no indication of any comic intentions, as in the testimony of Osborn Jones, possibly a Welshman:
I came home to Tinner ant wass coing into my own Room, put the Prissoner's Wife callt to me and sait Here iss your coot Oman. So I hust her a pit, and ask her why a Tiffel she coudn't keep in her own Hapitation when I wanted my Tinner. So the Prissoners Wife prought out a pag with a crate teal of coolt in it. There was a crate many Pieces, a crate teal pigger as Guineas. […] (17350522-1)
The recording of non-standard features seems to be rather unbiased here and the non-standard spelling faithfully indicates a typical feature of Welsh English, the strong aspiration of voiced plosives (therefore perceived as voiceless by speakers of English English, see e.g. Thomas 1994: 122-123).
Nevertheless, even if the Proceedings were a 100% accurate record of the historical facts (which they are not), this would not automatically mean that the direct speech passages are a completely faithful picture of what was said in court. Written representations of spoken language can be several steps removed from the actual speech act and it is the task of the linguist to reconstruct the original speech event on the basis of the written text. This is what Schneider (2002: 68) calls the Principle of Filter Removal:
a written record of a speech events stands like a filter between the words as spoken and the analyst. As the linguist is interested in the speech event itself (and, ultimately, the principles of language variation and change behind it), a primary task will be to 'remove the filter' as far as possible, i.e. to assess the nature of the recording process in all possible and relevant ways and to evaluate and take into account its likely impact on the relationship between the speech event and the record, to reconstruct the speech event itself, as accurately as possible.
After a categorization of text types and their proximity to speech, Schneider (2002: 73) goes on to say that '[d]irect transcripts are clearly the most reliable and potentially the most interesting among all these text types' and names trial proceedings as characteristic examples of this category. Still, it is clear that as written records, even trial proceedings cannot be a completely faithful representation of the speech event and have to be handled with care. In addition to a consideration of the recording conditions, Schneider (2002: 86) lists internal consistency and external fit as important criteria for assessing the validity of written texts representing spoken language: internal consistency refers to the consistent portrayal of variable features across large corpora, ideally deriving from several sources (e.g. different authors), while external fit measures the degree to which results of analyses based on a specific corpus agree with findings of other studies. Culpeper & Kytö (2000) compare four 17th-century speech-related text-types (witness depositions, trial proceedings, prose fiction, and comedies) with the aim to establish how true they are to the original speech event. Based on the criteria of lexical repetitions, turn-taking features, and single-word interactive features (e.g. demonstrative pronouns), they conclude that 'there is a strong case for drama, but that there is also a case for trial proceedings' (2000: 195). )
Kytö & Walker (2003) assess the faithfulness of trial proceedings and witness depositions in representing authentic speech. Although both are purportedly verbatim texts (or at least conventionally assumed to be such), 'one could not expect the same standard of accuracy in quoting spoken interaction as one would when quoting a written text' (224). In addition, they caution that '[e]ven with the most faithful of records, it is to be expected that certain typical features of speech such as false starts, pauses, slips of the tongue, and the like would be filtered out …' (225). This is certainly true for the Old Bailey Proceedings, which for the most part lack some of the non-fluency characteristics of unscripted spoken language, such as hesitations (uhm, er, etc.), unfinished sentences, repetitions, etc. Using sources like trial transcripts one has to bear in mind that the primary aim of the scribe was not to record linguistic detail but the substance of the trial.
Just as Schneider (2002), Kytö & Walker (2003: 228) acknowledge that 'written records of a speech event are susceptible to interference — whether conscious or inadvertent — throughout the production process'. With this in mind, I will now attempt to assess the faithfulness of the spoken language in the Proceedings. Following the agenda set up by Schneider (2002: 86), I will do this by discussing the recording conditions, external fit, and internal consistency.
From the original speech event during a trial at the Old Bailey to the printed Proceedings, we can distinguish at least five consecutive stages (where t = time):
Each of these stages could potentially have altered the linguistic material of the utterance. At present it is still unclear whether the Proceedings actually went through t3 and t4 — it is imaginable, though rather unlikely, that typesetters worked directly from the shorthand manuscript. Be that as it may, we have to remove several layers of filters, imposed by the scribes (first while taking the shorthand notes in court and later when expanding them for the publisher), by the proofreaders as well as printers, by the typesetters and by the publishers (who, in addition to their own idiosyncrasies, might impose a house style).
The accounts were published just a couple of weeks after the trials. For example, the Proceedings of the sessions on the 11 and 12 December 1678 were licensed for publication a mere week later, on 18 December. This practice of rapid publication continued with the much longer later Proceedings (cf. e.g. 18281204, which were published before the end of the year). In fact, once the Proceedings came to be regarded as an official record, the city took an interest in ensuring speedy publication, as can be seen on the title page of the Proceedings published in December 1775:
At a Common Council holden in the Chamber of the Guildhall of the City of London on Friday the 17th of November 1775, A MOTION was made and QUESTION put, That the whole Proceedings on the King's Commission of the Peace, Oyer and Terminer, and Gaol Delivery for the City of London, and also the Gaol Delivery for the County of Middlesex, held at Justice Hall in the Old Bailey, be regularly, as soon as possible after every Session, published by the Recorder, and authenticated with his Name: The same was resolved in the Affirmative. (17751206, my emphasis)
Schneider (2002: 72) mentions 'the temporal distance between the speech event itself and the time of recording' as one parameter influencing the accuracy of the written record as a representation of the original utterance: the longer the interval between the two, the higher the risk of misremembrance. In the case of the Proceedings, t1 and t2 are near simultaneous (the scribe took notes during the utterance) and t3-t5 followed shortly after, i.e. the time factor does not pose much of a problem here.
What seems potentially more problematic is the recording technique used at t2: not all techniques (mechanical recording, shorthand, longhand etc.) are equally suitable to record linguistic detail. Kytö & Walker (2003: 228) mention that one of the factors influencing the reliability of a written record in terms of its faithfulness to the speech event is the script (notes or shorthand) used by the scribe. The (somewhat idealizing) implication is that shorthand is more reliable than notes because the latter are by nature sketchy and would have to be expanded later, relying more or less heavily on the memory of the scribe, while the former records the totality of the event in situ.
From at least 1749 onwards, but probably from the very beginning in the 1670s, the proceedings at the Old Bailey were recorded in shorthand. A thorough analysis of 18th-century shorthand practices and their influence on the linguistic reliability of the Old Bailey Proceedings would go beyond the scope of this paper, but a brief overview of the possibilities and limitations of stenography with regard to the faithful representation of the original speech event will show the important consequences that the script has for the preservation of linguistic detail.
One of the more influential and popular 18th-century shorthand systems was developed by Thomas Gurney, the scribe who took down the Proceedings from at least 1749 to his death in 1770. Gurney's Brachygraphy or short-writing first appeared in 1752, ran through twelve editions in the 18th century and was reprinted several times in the 19th century. If we assume that in recording the trials Thomas Gurney, and later his son Joseph, who succeeded his father in 1770, used a shorthand system identical or similar to that described in Brachygraphy, then a closer inspection of this system may reveal important clues as to the linguistic reliability of the Proceedings. I will start with a brief characterization of the script and then focus on the implications for the rendering of spoken language in the Proceedings.
Gurney's (1752: 3) avowed objective was to enable the shorthand writer 'to take a Speech, or Sermon verbatim, as a Person talks in common'. His script consists of an 'alphabet' of invented symbols for consonants and vowels but has some characteristics of a consonantal writing system in that vowels can be left out. For example, he transcribes <lmntsn> 'lamentation' or <msngr> 'messenger' (p. 11). In spite of this, vowels are often indicated by diacritics (through dots or the vertical position of the following consonant). High-frequency words 'such as Prepositions, & terminations' are represented by 'arbitrary Characters' (3). These logographic elements are mostly derived from symbols of the basic alphabet and extended iconically (e.g. 'little' vs. 'large', both from the symbol for l, p. 12) and can represent several words (for instance a dot. can mean 'they, thee, the, thy, of', p. 14). In principle, however, Gurney's shorthand follows conventional orthography, as illustrated by his transcription of loan, which indicates both <o> and <a> (p. 13):
However, the script also has some phonological traits in that e.g. <gh> in brought is omitted (p. 13):
In a similar manner, law is rendered as ˙ (l-a, p. 27).
The following example illustrates the phonological principle in that the <e> in single and line is omitted. It also contains orthographic elements in that (1) – stands for <in> and thus represents both [In] (in single) and [aIn] (in line) and (2) the velar nasal in single is expressed by two symbols, – and . 
I will now consider some implications of Gurney's stenographic script for the faithfulness of his trial transcripts. As mentioned before, this is meant as a first approach to the problem, not as an exhaustive analysis.
The symbols introduced in Gurney's chapter on 'Persons Moods & Tenses' (19-22) do not distinguish between inflected and uninflected auxiliaries ( stands for 'may' or 'mayst', for 'can' or 'canst', for 'should' or 'shouldst', etc., p. 18-20). One could argue that this renders the Proceedings less reliable as far as the inflection of verbs in the 2sg present tense is concerned, but apart from the fact that the context would disambiguate the possible readings (you may but thou mayst), we know that by 1700 thou and the appropriate -est inflection had undergone functional contraction and were largely restricted to dialects, biblical and archaic language, and the speech of Quakers (Görlach 1991: 85, 88).  However, even if 2sg verbal inflection is only marginally relevant in the 18th century, the foregoing is an indication that even shorthand could not have been absolutely accurate in the recording of details like inflections. A further example is provided by the symbol for the indefinite article, a dot placed on the top left of the noun phrase, which stands for the two allomorphs a and an (p. 16). A study on variation in this area (e.g. a ~ an before nouns starting h-) has to proceed very carefully indeed since the shorthand manuscript would not have distinguished between a and an, using a simple dot in both cases. Only when expanding his shorthand notes for the typesetter would the scribe choose a particular allomorph. That is, the form of the indefinite article that we find on the printed page of the Proceedings depended for a good part on the scribe's memory. With high-frequency items like inflections or articles it is very unlikely that the scribe would have remembered the exact variant used in every single instance, not even after only a couple of hours.
This is a rather sobering finding, given that shorthand-based recordings of spoken language have so far been accepted as relatively faithful in the literature (see above). Nevertheless, Gurney's brachygraphy does record other details, including features of spoken language like auxiliary contractions: 'you will' (you w-il) vs. 'you'll' (you-l, p. 27). But even here are difficulties, e.g. when we turn to proclitic 't (< it): Gurney's shorthand representation of 'twill (< it will) is ambiguous since the symbol for t is also used as an abbreviation for it (p. 11): Gurney transcribes orthographic 'twill ׀ (p. 26-27), with a space between t/it and will (similarly 'tmay ׀ ˙). Because of this space there is no way of knowing, on the basis of the shorthand manuscript alone, whether ׀ represents the full form it or proclitic 't.
The foregoing remarks demonstrate the need for a study establishing whether results based on an analysis of features that can be unambiguously encoded in the shorthand script (e.g. contraction of will/shall) are more reliable than results based on features where the shorthand system is ambiguous (e.g. it ~ 't). In any case, linguists analyzing spoken language recorded in stenographic writing will do well by familiarizing themselves with shorthand practices of the period to assess the reliability of their material.
Unfortunately, the examples in Gurney's manual do not contain instances of negative contractions like can't or doesn't, the feature analyzed in the following sections. Gurney's chapter on 'The Negative not ¬ ' (p. 23) only transcribes the expanded forms would not, cannot, shall not, must not, might not, may not, ought not, and was not. Therefore, although it is theoretically possible to represent negative contractions in Gurney's system, it is impossible to say whether Gurney would have differentiated between cannot and can't, etc. This can only be checked by comparing an original shorthand manuscript with the printed version of the Proceedings, but so far, such manuscripts have not come to light (Tim Hitchcock, p.c. 2007-02-20).
What we do have, however, is interesting evidence concerning the working methods of the scribes in the trial accounts themselves. In the second trial of Elizabeth Canning, Thomas Gurney was asked to report to the court what he had taken down during the first trial.  The attorney then proceeded:
Note Gurney's use of 'substance' rather than something like 'her very words' (cf. also 17560528-45). Later in the trial, Gurney was asked to compare the testimony of Elizabeth Canning’s mother with what she had said in the first trial:
Again, Gurney made it clear that while he strove to be faithful to the spoken word, this was not always possible or even desirable (on the role of the scribe, cf. also Shoemaker forthcoming). In a 1758 trial, he was asked to recount the statement of a foreigner, which he did in Standard English, adding that 'I took that to be his meaning which I have printed, he speaking as most of the foreign Jews do, a sort of broken English', making it clear that there was a linguistic difference between the actual speech act and its representation in the Proceedings (17580113-30).
The left column of the text in Appendix A shows an extract from the trial of John Ayliffe, 17591024-27, and an alternative account of the same trial, entitled The tryal at large of John Ayliffe (henceforth referred to as Tryal) in the right column.  One can see at first glance that the account in the Proceedings is considerably shorter than the alternative Tryal (718 as opposed to 1,290 words). It is interesting that both versions were 'Printed, and sold by M. Cooper at the Globe in Pater-noster-Row', so one would suppose that Cooper would either have produced a longer and a condensed version from the same manuscript or produced the longer Tryal first and abridged it for inclusion in the Old Bailey Proceedings. However, things are not that straightforward. There is some of overlap between the two versions, as in e.g. lines 3, 30, 52:
Sometimes the Proceedings simply omit some text of the longer version, either complete speech acts as in lines 21 or 42:
or parts of a speech act, e.g. line 48:
But there are also more serious differences between the two versions:
The last point in particular casts doubt on Kytö & Walker's (2003: 234) statement that '[w]hat a 'faithful' or 'verbatim' record is generally expected to convey, to a large extent, is the lexical items and grammatical structures'. The differences between the two versions suggest that they come from two different scribes rather than being an abridged and an expanded version based on the same manuscript. What this shows us yet again is that the Proceedings (just like other early trial accounts) cannot naïvely be taken to contain truly verbatim accounts of the trials at the Old Bailey, even though they were taken down in shorthand. At the same time, however, they are not automatically less reliable than other accounts.
3.3 Testing the reliability of the Old Bailey Corpus: a quantitative analysis of negative contraction
In the following I will test the reliability of the Proceedings as a source of spoken language on the basis of the variation between contracted and uncontracted forms of negated auxiliaries, such as do not vs. don't.The choice of negative contraction as a diagnostic feature for the linguistic reliability of a text representing speech is motivated by the fact that negative contraction is an established characteristic of present-day spoken language (Greenbaum & Nelson 2002: 211; Mazzon 2004: 105), including legal English. For example, in the 13 courtroom texts of spoken English (127,474 words) included in the web-version of the British National Corpus, contracted forms account for 72.4% of all negated auxiliaries (807/1,115), while in the corresponding written category (non-academic political, legal and educational texts; 4,477,831 words in 93 texts) they make up just 15.3% (1,808/11,852), and many of these actually occur in quotes of spoken language.  Given the tendency of contracted forms to predominate in today's spoken English, we proceed on the hypothesis that negative contraction is more frequent in the spoken passages than in the prose text of the Proceedings.
Tables 3 to 6 show the distribution by decade of contracted and uncontracted negation involving auxiliaries in the prose and speech passages of the Old Bailey Corpus (OBC) from 1732 to 1834.  The tables subsume orthographical variants under the forms indicated in the first column. Thus haven't includes haven't, ha'n't, han't, and so on. Tables 3 and 4 are based on the speech passages, Table 5 on the prose passages, and Table 6 presents a summary:
In the spoken passages from 1732 to 1834, there are over 20,000 instances of contracted auxiliaries, in other words, 6.4% of all negated auxiliaries show contracted n't in the speech passages. Most of the tokens are accounted for by don't, can't, and won't (there are only 18 tokens of aren't, haven't is represented with only 22 tokens, and shan't is found only 109 times). By contrast, in the prose passages of the same period there are only five contracted forms in total, less than 0.1% of all negated auxiliaries. 
The first solid conclusion we can draw from this is that there is a significant difference in the distribution of contracted and uncontracted negative auxiliaries in the OBC, with the former being almost exclusively confined to spoken language. The corpus therefore reflects the characteristics of spoken language, but it remains to be shown in how far this distributional pattern mirrors the actual spoken language of the period. In comparison to the BNC's 72.4% contraction rate in spoken legal English, the Old Bailey's 6.4% seem rather low. Several factors can account for this discrepancy:
To test whether the seemingly low ratio of negative contraction in the OBC results from its possibly unfaithful representation of the characteristics of spoken language, I will first compare the picture we find in the Proceedings with what we know about the general development of negative contraction in the history of English. Contraction of not emerged in the 17th century, maybe as early as 1600 in speech and in the second half of the century in writing (Barber 1997: 180; Brainerd 1989; Strang 1970: 151; Warner 1993: 208). Lass (1999: 180) notes that '[c]litic spellings are uncommon until the 1660s; they are frequent in Restoration comedy, and by the early eighteenth century seem to be the norm in speech'. It is not clear on what evidence Lass bases this last claim, but given that negative contraction got more common in writing only towards the end of the 17th century, the situation presented by the Proceedings does not seem too far off the mark.
A clearer picture is afforded by comparing negative contraction in the OBC with that in another corpus of spoken English, the Corpus of English Dialogues 1560-1760 (CED). The CED includes five genres, trials, witness depositions, drama comedy, didactic works, and prose fiction (for a full documentation of the CED see Kytö & Walker 2006). Table 7 shows negative contraction in the CED trial texts from 1560 to 1760.  For comparison, the table also includes n't -forms in the OBC for the period of overlap with the CED, 1732-1759.
The CED corroborates the claim in the literature that negative contraction started in the first half of the 17th century but became frequent only in the last decades of that century. Comparing the last sub-period of the CED (1720-1760) with the first sub-period of the OBC considered here (1732-1759), a Chi-square test shows a significant difference only for can't/cannot (p ≤ 0.01). The differences involving the other auxiliaries are not significant (don't/do not: p ≤ 0.2, shan't/shall not: p ≤ 1, won't/will not: p ≤ 1). In other words, for the period of overlap, the rate of negative contraction is rather similar in the two corpora. Even for the significant difference with regard to can't, there is only a 12.7% gap between the CED and the OBC. There are two further parallels: (1) both corpora show an absence of negative contraction with past forms of auxiliaries,  and (2) for the period of overlap, the OBC shows negative contraction with the same auxiliaries that attract n't in the CED.  Thus, at least as far as the period of overlap is concerned, the distribution of contracted vs. uncontracted negatives in the speech passages of the OBC is similar to that of a sampled corpus, the CED. The OBC can therefore be taken to be just as representative of spoken language as other trial texts.
The comparatively low rate of 6.4% negative contraction in the OBC is due to several factors: first of all, negative contraction is only attested with non-past auxiliary forms, while the percentage was calculated on the basis of non-past and past expanded forms. Secondly, in the OBC, n't attaches only to six auxiliaries and not to the others (n't has a larger range today). If one factors out past forms and those auxiliaries that do not attract enclitic negatives, then the picture looks more familiar from the perspective of present-day English (though still not identical, which one would not expect it to be, given the time difference of ±200 years). Note that a strictly formal approach has been taken in Table 8: starting form negative contraction, only those uncontracted auxiliary forms are included that match the contracted counterpart.
Therefore, since be shows negative contraction only in the form aren't (there are no tokens of isn't), the figures for e.g. uncontracted are not exclude am not, is not, was not, and were not. Similarly, do not excludes does not and did not, and have not excludes has not and had not:
Figure 6 is based on Table 8 and shows the percentage of the contracted negatives can't, don't, ha'n't, shan't, won't, and aren't vs. their uncontracted counterparts in the speech passages.Section 2). Further studies will have to show whether other corpora show a similar kind of behaviour with regard to negative contraction in the 18th century.
The decline in negative contraction may also be a function of the increasing control that the City authorities exerted over the Proceedings in the course of the 18th century. Robert Shoemaker (p.c. 2007-05-20) suggests that 'the character and audience of the Proceedings changed significantly between 1720 and 1778, when they entered the period of close City control. As they became longer and more expensive during this period, the language became more respectable'.
Figure 6 shows a strong fluctuation of contraction rates, especially as far as don't is concerned. To a certain degree this is the result of the small intervals chosen here (decades), but there may also be other reasons: as mentioned before (Section 4), there are several layers of filters that stand between the speech event at the Old Bailey and the linguist trying to reconstruct the spoken language of the period. These are the filters imposed by the scribes, by the proofreaders, the typesetters and printers, and by the publishers. It is to the influence of these persons that I will turn in the following. From the late 1730s, the title pages of the Proceedings regularly mention the scribe and/or printer.  Table 9 gives an overview of this information:
Joseph Gurney is the first scribe to be mentioned in the Proceedings, on the title page of 17730908, but we know that his father had been taking shorthand notes of the sessions since at least 1749, when his name first appears in an advertisement at the end of the Proceedings: 'SHORT-HAND Taught in an easy and expeditious Method, by Thomas Gurney, the Writer of these Proceedings' (17490113, my emphasis). Similar advertisements appeared in 129 further issues, so Thomas Gurney was responsible for recording the trials from the mid-18th century onwards. After his death in 1770, his bookbinder son Joseph took over the business of recording and publishing the Proceedings (see the advertisement in 17700711: 'By the late Mr. THOMAS GURNEY, upwards of Twenty Years Writer of these PROCEEDINGS'). For 85 years, therefore, we know who the scribes were and, for an even longer period, who printed the Proceedings. 
With this information we can now proceed to a micro-study of the material in the OBC in order to establish whether there is a significant correlation between the scribes/printers and the linguistic detail captured in the Proceedings. Perhaps the most noticeable fluctuation in the development of negative contraction shown in Figure 6 is the sudden drop of contraction from an average of 29% in the 1770s to a mere 4% in the 1780s, only to rise again to 23% in the 1790s. This could be an indication of an internal inconsistency of the corpus and will therefore be the focus of the first case study. Table 10 splits the Proceedings in the 1780s up by scribe and indicates the respective figures for negative contraction:
Joseph Gurney took down 15 proceedings from January 1780 to November 1781 and 3 proceedings from May 1782 to July 1782. The four sessions from December 1781 to April were transcribed by William Blanchard. However, E. Hodgson was the scribe who was responsible for the bulk of the Proceedings in the 1780s, 67 in all. For the present purposes we will have to assume that apart from the change in the person of the scribe all other parameters remain equal, i.e. we will idealize and assume that there is no significant language change within one decade and that the sociobiographical composition of the trial participants remained the same throughout. Figures 7 and 8 chart the percentages of negative contraction for can't and don't:
Overall distribution: p ≤ 0.001. The difference B↔GII is not significant (p ≤ 1). The difference GI↔GII is significant (p ≤ 0.01). The differences between all other pairs are highly significant (p ≤ 0.001).
Overall distribution: p ≤ 0.001. The differences between all pairs are highly significant (p ≤ 0.001).
Chi square tests show that, except for Blanchard↔Gurney II in Figure 7, all differences in the rate of negative contraction between the scribes are significant. The first finding is that there is a much lower rate of negative contraction in Hodgson's Proceedings than in those of Gurney and Blanchard in the 1780s. Averaging Gurney and Blanchard and setting them against Hodgson yields the following picture:
It seems clear that the drop in negative contraction in can not and do not in the 1780s is due to Hodgson. This scribal effect is much more pronounced in do not contraction since the difference between Gurney/Blanchard and Hodgson is more than twice as high (23.7%) than in can not (11.1%). Note also with regard to the latter that there is a tendency for a lower significance in the three Gurney↔Blanchard pairs, with B↔GII not significant and GI↔GII significant 'only' at the 0.01 level. There is therefore some, albeit small and rather variable, measure of agreement between Gurney and Blanchard in as far as negative contraction with can is concerned, but no agreement between the two on the one side and Hodgson on the other, which could be an indication of Hodgson's lower faithfulness with regard to the recording of instances of can't.
Unfortunately, the Proceedings do not indicate the printers in the period considered here, so it is impossible to test whether they may also have played a role in the differences. What is needed, then, are short periods in the Proceedings in which the printer stays the same but the scribes change and, conversely, a period where the scribe is the same but the printers change.  As to the first case, W. Wilson printed the 17951202-18051030 Proceedings, recorded by three different scribes in succession, Marsom & Ramsey, William Ramsay, and Ramsay & Blanchard. The other case is afforded by Thomas Gurney, who recorded, among others, the ten years between 17511204 and 17611021, with M. Cooper, J. Robinson, G. Kearsley, and J. Scott as printers.
I will start with a study of a 'same printer/different scribes' period. Figures 9-11 illustrate this case for 17951202-18051030 with regard to negative contraction of can not, do not, and will not:
Overall distribution: p ≤ 0.01. The differences M&R↔R&B and WR↔ R&B are not significant (p ≤ 0.2). The difference M&R↔WR is significant (p ≤ 0.01).
Overall distribution: p ≤ 0.001. The differences between all pairs are highly significant (p ≤ 0.001).
Overall distribution: p ≤ 0.05. The differences M&R↔WR (p ≤ 0.2) and M&R↔R&B (p ≤ 1) are not significant. The difference WR↔ R&B is significant (p ≤ 0.025).
All three scribes show a very low percentage of can't, a trend which had already begun with E. Hodgson in the 1780s. Chi square shows a significant difference only in M&R↔WR, but this difference is very low (1.0%), meaning that the scribes represented negative contraction of can not in a similar way. The scribes also agree on the percentage of won't, around 10%, the only significant difference being WR ↔ R&B (at the 0.025 level). Interestingly, don't offers a different picture, with (1) a generally higher contraction rate than can not and will not, and (2) a pronounced and highly significant difference between the three sub-periods/ scribes. One way to interpret this is that scribes can be more faithful in the representation of some linguistic features (percentages of can't and won't in this case) than of others (don't). Presumably, several reasons play a role in this 'differential faithfulness', including linguistic and social salience of the variants in question. Unfortunately, there is the complicating factor that Ramsey had a hand in all three sub-periods, and he may well have been the dominating factor in the teams with Marsom and with Blanchard.
The last case to be considered is that of 'different printers/same scribe'. For this, I will consider the 17511204-17611021 period, transcribed by Thomas Gurney and printed in turn by M. Cooper, J. Robinson, G. Kearsley, and J. Scott:
Overall distribution: p ≤ 0.001. The difference CII↔K is significant (p ≤ 0.025). The differences between all other pairs are highly significant (p ≤ 0.001).
Overall distribution: p ≤ 0.001. The differences CI↔R (p ≤ 1), CII↔K (p ≤ 1), and CII↔S (p ≤ 0.2) are not significant. The difference K↔S is significant (p ≤ 0.05). The differences between all other pairs are highly significant (p ≤ 0.001).
Figures 12 and 13 are interesting in that they show a difference between a relatively high rate of negative contraction with printers CI and R, but a comparatively low rate with CII, K, and S. In other words, with Thomas Gurney as the scribe throughout, the differences in contraction rates must be due to the printers. Note also that in both figures there is a significant difference between the two sub-periods printed by Cooper. This demonstrates that the Proceedings show variation in the representation of a linguistic variable even though scribe and printer are the same.
In sum, the variation in negative contraction presented by the three test cases considered above can perhaps best be captured with what I have called 'differential faithfulness'. On the intra-scribal level this means that individual scribes and printers can be more faithful with regard to the representation of some linguistic variants (maybe because of the variants' greater social or linguistic salience or indexical function) than with regard to others. On the inter-scribal level there may be agreement between different scribes/printers only on certain variables and not on others, as the Marsom/Ramsey/Blanchard test case has shown (agreement on a low contraction rate for can not, but disagreement on the rate for do not). An idealizing cross-figure comparison of negative contraction as represented by different scribes is given in Table 11, where a − stands for a very low rate of contraction and a + for a comparatively high rate:
In a comparatively short period of time (25 years) we find considerable variation between three groups of scribes: Gurney/Blanchard show a relatively high percentage of negative contraction, Hodgson a very low rate, and Marsom/Ramsay/Blanchard occupy an intermediary position with little can't contraction but a rather high percentage of contraction in don't.
4. Creating the (linguistic) Old Bailey Corpus 
The Old Bailey Corpus will be searchable online with unrestricted access. The reason for not freely disseminating the corpus itself is that the copyright in the electronic text version is owned by the University of Hertfordshire and the University of Sheffield. The following is a graphic representation of how the online search function will operate:
The digitized transcripts of the Old Bailey Proceedings obtained from Robert Shoemaker and Tim Hitchcock are already heavily annotated, as illustrated by the following excerpt, from the beginning of a trial in 1733:
Spoken language is not tagged in this version, but there are a number of sociolinguistically useful tags, most importantly
Speaker origin is probably the most unreliable parameter since it has to be established indirectly, through the crime location, <crimeloc>. The vast majority of speakers in the Proceedings resided in or around London anyway, the jurisdiction of the Old Bailey, and there is often no way of telling whether the crime location corresponds to the speaker's area of residence. Also, London's population was characterized by immigration both from the British Isles and elsewhere in the world. Most immigrants arrived as young adults and may have lost some of the characteristics of their original varieties after a few years of residence in the capital. However, since the origin of the speakers within London or Middlesex may be interesting for micro-sociolinguistic studies of the geographical diversity of London English in the 18th and 19th centuries, it was decided to keep this parameter.
Identifying and tagging direct speech was the first step in turning the Proceedings into the OBC. The objective was to extract the spoken passages from the Proceedings and assign them to individual speakers (i.e. compiling the information stored in MySQL Database 2 (see Figure 14).
Because of the size of the material (52 million words) and limited resources, manual identification of spoken language in the Proceedings was out of the question. In search for alternatives, I first considered developing a software program that identifies and tags speech passages on the basis of morpho-syntactic features common in spoken language, such as first and second person pronouns. It soon turned out that such an approach would have been far to complex, error-prone and time-consuming. Instead, it was decided to base the process on formal rather than linguistic patterns: we created a Pearl script that tagged spoken language on the basis of keywords and patterns in the xml-structure/layout of the Proceedings.  A complete algorithm of the tasks performed by this tagger can be found in Appendix B, but the following paragraphs will give a general impression of the approach taken here, illustrated by selected examples.
A promising procedure to identify spoken language is to look for metalinguistic information that is present in any printed text (like new paragraphs to indicate speaker changes). Obviously, the first strategy that came to mind was to tag everything between inverted commas as speech, but it turned out that inverted commas are extremely rare in the Proceedings. Compare the plain-text excerpt from 17330510-1, the trial already cited above:
There are no inverted commas here, but one regularity in this text (and elsewhere in the Proceedings) is that every speech act occupies one paragraph. In other words, a speaker's utterance ends at a paragraph break, and </speech> tags can accordingly be inserted in this position, as shown in the excerpt above.
The situation is more complicated when we look for the start of the speech act, because it does not necessarily coincide with the beginning of a paragraph. Compare the second paragraph, where John Underwood's statement starts with the third word, the first two ('John Underwood.') simply identifying him as the speaker. That is, the <speech> tag has to be inserted after the speaker name. Again, paragraphs starting with the speaker's name are a fairly common pattern in the Proceedings and were used for tagging purposes. For the purposes of the Pearl script, a name was defined simply as a string of letters. The script assumes that paragraphs can start with either one name followed by a full stop ('Smith. I was walking …') or two names followed by a full stop ('John Smith. I was walking …'). While this yields correct results in a good number of cases, there are also a variety of exceptions that have to be taken into account. For instance, paragraph 3 in the excerpt above begins 'Mrs. Underwood.' The standard routine identifies 'Mrs.' as a name followed by a full stop. However, <speech> should be inserted after 'Underwood.', not after 'Mrs.' To avoid misplacement of the tag in such cases, the script checks a list of exceptions to the standard routine before the tag is inserted. This list includes strings at the beginning of paragraphs such as 'Defendant.' or 'Prisoner.' The Pearl script also makes use of tags in the electronic version of the Proceedings prepared by Tom Hitchcock and Robert Shoemaker. In this version, many names are tagged as given names and surnames, so the presence of <given> and <surname> at the beginning of a paragraph is of help in placing the <speech> tag.
Question-answer sequences constitute a frequent exception to the general rule that one utterance occupies one paragraph: in the Proceedings, both turns of the adjacency pair are often found in the same paragraph. The approach described so far cannot identify the beginnings and ends of two utterances in the same paragraph, but question-answer sequences also show regular patterns that can be of help in doing this. In the original Proceedings, most of these sequences are marked, with slight variations, 'Q. – A.' and the Pearl script uses this metalinguistic information to insert the <speech> tags at the appropriate places:
The main task in creating the OBC is to gather sociobiographical speaker data and to link these with the speech sections in the Proceedings, as described in Figure 14. Again, because of the large size of the corpus a completely manual annotation was impracticable. Instead, an annotation tool was developed that automatizes this process as far as possible.  With some adaptations this tool will also be useful for similar annotation purposes in other corpora. Figure 15 shows a screenshot of the Old Bailey Tool, highlighting its main components and functions.
The text window and the tag assistant are the main components of the Old Bailey Tool. First, an xml-file is loaded into the text window. For easier reading, tags can be faded to grey or highlighted in various colours/styles, as shown in the screenshot. The next step is to have the speaker-ID generator in the tag assistant extract all tagged names from the xml-file and assign a unique speaker-ID to them. An alphabetical list of names is shown in the bottom left window of the tag assistant ("Names: alphabetical"), with the speaker-IDs next to them. For example, in the screenshot, Francis Perry has the ID 69. Clicking the genderizer button then automatically assigns the sex to the speakers by consulting a list of about 7,300 male and female first names and their orthographical variants. This captures more than 95% of the names in the Proceedings; the rest can be added manually.
The annotation process starts after these preparatory steps. The buttons "next (down)" and "next (up)" will let the user jump from one <speech> tag to the next in the xml-file. In the screenshot, the current position is the <speech> tag in red ("Prisoner's Defence"). The tag assistant automatically checks for names around the current position and shows them in the "Names found near current tag" window. The reason for this is that there is a high likelihood that the speaker at the current position is identical with one of these persons, which saves the time of scrolling up and down in the alphabetical list. By double-clicking on the appropriate name, either in the alphabetical list or in the "names near current tag" list, a unique speaker-ID will be inserted into the <speech> tag, consisting of the file name (t17750426 in this case) followed by an underscore and the ID shown next to the speaker in the alphabetical name list (_0069 in the case of Francis Perry). As one goes along, the "Recent selections" name list will display the names of speakers whose ID was inserted previously, again because the likelihood is high that the current speaker is again one of these.
The context often contains sociobiographical information on the speakers, for example witnesses frequently begin their statements with 'I am a (profession label)', as Henry Dixon in the screenshot 'I am a pawnbroker'. While inserting speaker-IDs into the xml-file, this information has to be gathered and inserted in the age, profession, and location fields next to the alphabetical list. Speaker-IDs and the sociobiographical details are stored in a database which can be exported for further processing. In addition to this information, the database will contain the names of the scribe and printer of the respective Proceedings, to help corpus users assess the validity of their findings, for instance as demonstrated in Section 3.3.2.
The Proceedings of the Old Bailey constitute a large body of texts, whose speech passages are arguably as near as we can get to the everyday language predating the invention of audio recording technology.
This article started with an overview of the historical background and structure of the OBC, a 50+ million word linguistic corpus based on the Proceedings. Direct speech becomes more common in the 1720s, from when on almost 85% of the text is spoken language. Age is regularly mentioned only from the 1790s, but it is hoped that more information for earlier years can be added during annotation of the corpus. More than 70% of the speakers are men, but this imbalance is remedied by the size of the OBC, which ensures that even with a low percentage, the absolute number of women is still high enough for historical sociolinguistic studies. In addition, the large variety of occupation and status labels of the participants in the trials will be useful in forming social classes.
Section 3 dealt with assessing the reliability of the OBC as a source of spoken English in the 18th and 19th centuries. It looked at the filters imposed in the different stages of the genesis of the Proceedings and investigated external fit and internal consistency. With regard to the filter effect, the simultaneity of the speech event and its recording as well as rapid publication after the sessions at the Old Bailey are arguments in favour of a rather accurate portrayal of spoken language in the Proceedings. On the other hand, the investigation of Gurney's shorthand system showed that even a supposedly verbatim mode of recording did not in all cases result in an absolute faithful representation of the speech event.
The external fit of the Proceedings was examined by comparing a sample trial with an alternative account of the same court case. Although there is some verbal overlap between the two versions, there are also substantial differences, including omissions as well as verbal, morphological, and syntactic divergences. While this in itself does not necessarily discredit the Proceedings as a source of authentic spoken language (it could well be that the alternative account is the less reliable one), it shows that trial accounts cannot simply be taken at face value but have to be evaluated carefully. In a second step to assess the external fit, negative contraction was chosen as a diagnostic linguistic feature of spoken language. An important finding was that negative contraction is (almost) exclusively found in the speech passages of the Proceedings, demonstrating that the scribes did systematically differentiate between speech and prose, which lends some credibility to their portrayal of spoken language. A comparison of the OBC findings with the CED showed further that both corpora agree in the auxiliaries that n't cliticizes to and show comparable (though not identical) rates of contraction for these.
Internal consistency was tested by micro-studies of negative contraction in three short sub-periods of the Proceedings — 1751-1761, 1780-1782, and 1795-1805 — in which either the scribe or the printer varied. The result was that scribes can differ from each other in the rate of negative contraction either across the board or with regard to individual auxiliaries. This differential faithfulness was seen as an indication that scribes can be more accurate in the representation of some linguistic features than of others. The third micro-study showed that variation in the rate of negative contraction can also be due to the influence of the printer. The conclusion to be drawn from all this is that the representation of linguistic features in the OBC, as in other trial collections, can be distorted by scribal and/or printer interference. Corpora including trial proceedings and studies based on such corpora have to take account of the fact that what looks like language variation and change may in fact be due to the influence of scribes and printers.
In compiling and annotating the OBC, the major task was to identify speech passages and link them to sociobiographical speaker parameters such as sex, age, or profession. Digitized and xml-encoded transcripts of the Proceedings were kindly provided by Robert Shoemaker and Tim Hitchcock, and identification of spoken language was achieved with the help of a Pearl script that tagged these passages on the basis of keywords and patterns in the xml-structure/layout of the Proceedings. The Old Bailey Tagger was developed for speaker annotation. This is a tool that automatizes speaker identification and the collection of sociobiographical data. With some adaptations this tool will also be useful for similar annotation purposes in other corpora.
In spite of these caveats, trials proceedings are still among the few and best sources we have of spoken language before the advent of mechanical recording. Some studies suggest that comedy drama presents an even more faithful picture, but trial accounts have the advantage that they are based on a real, not an imagined, speech event. Even if they are not completely true to that speech event, they are at least guided by it, whereas dialogue in drama is for the most part simply invented.
 For detailed background information on the Old Bailey and the publication history of the Proceedings consult the excellent Old Bailey Proceedings Online, from where the information presented in this section has been taken.
 Eight-digit references are to the Proceedings reference number as used in the Old Bailey Proceedings Online. The first four digits indicate the year of the trial, followed by two digits each for the month and day. A hyphen followed by a number indicates the particular trial in a particular issue in the Proceedings.
 Digitization of the 1834-1913 Proceedings of the Central Criminal Court is under way. They will be launched in the spring of 2008. My plan is to integrate this material in the Old Bailey Corpus, which will then span almost 200 years of spoken Modern English.
 In the Corpus of English Dialogues 1560-1760, negative contraction in don't (1600-1639), shan't (1640-1679), and won't (1640-1679) is attested one subperiod earlier and with a generally higher relative frequency in the genre comedy than in the genre trial. Comedy also includes (rather infrequent) contractions in mayn't and mustn't, which are totally absent from CED (and OBC) trials.
 Note that <ll> (from the last sound in single plus the first sound in line) is indicated by boldening , the symbol for 'l'. This could be interpreted as both phonological — if we assume that assimilation and elision processes like /ll/ > [l] are not accounted for in the script but instead citation forms of the words are transcribed — and orthographic — if we assume that the primary guide is the conventional orthography and that double-l results from contraction of the last and first letter across a word boundary.
 I am grateful to Robert Shoemaker for bringing these to my attention. At least from the middle of the 18th century on, the court officially relied on scribes to report minutes of previous sessions. Thus, Joseph Gurney was called to testify in 17710220-82, 17710410-64, 17711023-94, 17730113-78, 17780715-92, 17780715-93, and 17820515-60.
 By far the largest number are spelt in one word, cannot (there are only 62 tokens of can not in the OBC 1732-1834 period). Brainerd (1989: 180-181) treats cannot as an early form of contraction analogous to shall not > shannot and will not > wonnot. However, since the spelling of cannot does not suggest any phonological change, it will here be treated as an uncontracted form.
 Note that although the CED also contains some accounts of trials at the Old Bailey, these are alternative accounts and not identical with those in the Proceedings. The CED trials can therefore be compared with the OBC without danger of circularity.
 Interestingly, a quotation search in the Oxford English Dictionary on CD ROM shows the following first attestations of negative contraction with past forms of auxiliaries: couldn't 1800, didn't 1705, hadn't 1775, mightn't 1865, shouldn't 1628 (!), wasn't 1797, weren't 1845, and wouldn't 1794.
 More information on who transcribed, printed and published the Proceedings can be gleaned from a close inspection of the main text and the advertisements, but for the present purpose Table 9 is sufficient.
 There are some gaps for the printers, though. The gap from 17771203 to 17921031 is due to the fact that the Proceedings were 'printed for' the respective scribes in this period. The actual printers are not identified.
 I would like to thank Eva Kapp, Manuel Müller, Magnus Nissel, Andreas Reuter, Ulrike Schneider, Tracy Sutphin, and Alexandra Tran, my student helpers at Giessen University, as well as my research assistants Thorsten Brato and Svetla Rogatcheva for tagging and annotating the OBC.
BNC = British National Corpus. CQP-edition (Version 3.0), developed by Sebastian Hoffmann (University of Zurich) and Stefan Evert (University of Osnabrück), http://www.natcorp.ox.ac.uk/
CED = Corpus of English Dialogues 1560-1760, compiled under the Supervision of Merja Kytö (Uppsala University) and Jonathan Culpeper (Lancaster University), www.engelska.uu.se/Research/English_Language/Research_Areas/Electronic_Resource_Projects/A_Corpus_of_English_Dialogues/
OBC = Old Bailey Corpus (OBC), http://www.oldbaileyonline.org/
Hitchcock,Tim & Robert Shoemaker. 2007a. "Publishing history of the Proceedings from their inception to 1834". Old Bailey Proceedings Online. http://www.hrionline.ac.uk/oldbailey/proceedings/publishinghistory.html, accessed 4 April 2007.
Hitchcock, Tim & Robert Shoemaker 2007b. "The value of the Proceedings as a historical source". Old Bailey Proceedings Online. http://www.hrionline.ac.uk/oldbailey/proceedings/value.html, accessed 15 January 2007.
Schneider, Edgar W. 2002. "Investigating variation and change in written documents". The Handbook of Language Variation and Change, ed. by J. K. Chambers, Peter Trudgill & Natalie Schilling-Estes, 67-96. Oxford: Blackwell.
Thomas, Alan. 1994. "English in Wales". The Cambridge History of the English Language, vol. V: English in Britain and Overseas: Origins and Development, ed. by Robert Burchfield, 94-147. Cambridge: Cambridge University Press.
Extract from the trial of John Ayliffe, Proceedings 17591024, and an alternative account of the same trial, The tryal at large of John Ayliffe (Tryal).
The Pearl script developed for identifying and tagging spoken language in the Proceedings treats the files as ordinary text files, not as xml-hierarchies, because it makes creating regular expressions easier. Anything tagged as <front>, <back>, <summary>, or <advert> is disregarded in the tagging process since these parts of the Proceedings do not contain speech. The following is a rough sketch of the patterns analyzed by the tagging software. A more detailed algorithm is given in the next section.
Perl Speech Tagger algorithm