Wright's English Dialect Dictionary computerised: towards a new source of information

Manfred Markus, University of Innsbruck


The computerised version of the OED and of many recent learners' dictionaries have demonstrated the great advantages of electronic versions of big dictionaries over their alternatives in bookform. The present paper investigates the possibilities and limits of computerised versions of regionalectal data as collected in the second half of the 19th century up to Joseph Wright's famous English Dialect Dictionary (1898-1905). Inspired by projects such as Ian Lancashire's compilation of dictionaries from the Early Modern English period, the present project, called SPEED ('Spoken English in Early Dialects') and supported by a grant of the Austrian Research Fund, will, in its first phase, provide a digitised databank version of Wright's unfairly neglected dictionary and encourage exploitation of the data which will then be available for dialectology, historical linguistics of spoken English and historical lexicology/phraseology. The present paper is mainly a broad description of the Dictionary's structure, i.e. it is concerned with the eight main parameters of its entries, the problems involved in these parameters or fields when they are transferred into a database, and the large amount of information stored in them.

1. Introduction

1.1 Dictionaries and glossaries of the English Dialect Society

The English Dialect Society, initiated by Walter William Skeat, has provided publications on a great many British dialects of the nineteenth century, mainly in the form of scholarly non-ambitious glossaries. These dozens of glossaries, which nowadays are mostly available in reprints, [1] would themselves be worthy of examination, since they are full of information, so far largely neglected, on Late Modern English dialect words, in particular, idiomatic usage of lexis, phrases, dicts and sayings, as well as local cultural habits. While this material would be challenging within cultural history, the most famous of the dialect dictionaries of the late 19th century, Joseph Wright's English Dialect Dictionary, must be taken seriously as an outstanding contribution to historical linguistics proper. Published in six volumes from 1898 to 1905, this imposing work of 5,000 pages is not only the best of 19th century dialect dictionaries, but also a remarkable achievement in its own right, useful for historical dialect geography as well as the study of Late Modern English idioms and spoken features.

Accordingly, this paper focusses only on Wright's Dictionary, with an emphasis on its corpus linguistic potential and the problems involved. The Dictionary is, in fact, the issue of a new Innsbruck project called SPEED (for 'Spoken English in Early Dialects'), funded by the Austrian Research Fund ('Fonds für Wissenschaft und Forschung'). Our project was inspired and encouraged by dictionaries already available on the Internet, for example within a project by Ian Lancashire (Toronto University), who has compiled various dictionaries of the Early Modern English period (cf. Lancashire/Patterson 1997). In our own case, the 5,000 pages of the six volumes have already been digitised, and at present (November 2006) are available in a provisional OCR version. Since some of the six members of the project team, including myself, have already been concerned with the complex structure of the Dictionary's entries, I would like to present the results of our first tentative analysis of this structure, highlighting Wright's labelling technique and the dialectal, i.e. regional, identification of entries.

1.2 Wright as a person and scholar

A few initial words should, however, be said about Joseph Wright as a person and scholar. [2] His academic career was most remarkable: born near Bradford in 1855, he grew up in poverty and under deprived family conditions, forced from the age of five to contribute to the domestic budget first as a donkey-boy, then as a child worker in a weaving mill. [3] In accordance with the 1870 Education Act, which provided compulsory schooling in Britain, and also with the help of both the Bible and Bunyan's Pilgrim's Progress, Wright learnt to read and write only when he was fifteen. His intellectual interest and ambition and his ensuing later career as a scholar would be worthy of a movie script; once literate, he developed a keen interest in French, German, Latin, mathematics, Indo-European languages, shorthand, chemistry and many other subjects. As a student, he was inscribed at the Yorkshire College of Science (later to be Leeds University), but also at the universities of Leipzig and, particularly, Heidelberg, where he later also functioned as an editor of Julius Groos Publishers. In the 1880s he started publishing himself by writing a Middle High German Primer (Clarendon Press, 1888). In Germany, Wright not only met supportive scholars, such as Osthoff and Brugmann, but also learnt to take a particular interest in dialects.

Wright's career as an academic teacher is an out and out success story, which, after casual teaching jobs at Oxford and Cambridge, culminated in 1901 in the position of Professor of Comparative Philology at Oxford. In 1898, he had started to publish the English Dialect Dictionary.

1.3 The English Dialect Dictionary (EDD)

The publication of the six volumes, five volumes for the Dictionary proper and one supplementary volume, took seven years, from 1898 to 1905. But Wright's task of compiling material had naturally started much earlier. In fact, he had made plans to edit the dictionary, in response to Skeat's offer, as early as 1887 (Holder 2004:248). Wright made use of the large number of written sources available on English dialects at the time, in particular, those of the English Dialect Society, mentioned above, but also of John Jamieson's Etymological Dictionary of the Scottish Language, 1/1808). Moreover, he and his team evaluated the thousands of paper slips sent in by "country gentlemen, clergy, mill-workers, farmers, students, enthusiasts of all sorts, both scholars and homely folk" (Holder 2004:255). Wright successfully played the role of lexicographer as well as that of editor, manager and businessman.

In the front matter of his work, Wright meticulously lists all the voluntary readers, compilers of unprinted glossaries, correspondents and editors of earlier glossaries. On that basis, the dictionary claims on its title page to provide the "COMPLETE VOCABULARY OF ALL DIALECT WORDS STILL IN USE, OR KNOWN TO HAVE BEEN IN USE DURING THE LAST TWO HUNDRED YEARS".

While this statement - after the evidence of lack of perfection in the case of the OED - should be taken with a pinch of salt, the importance of the EDD can hardly be overestimated. It was, as Wright himself proudly says in the Preface (V), "the largest and most comprehensive Dialect Dictionary ever published in any country", and to my knowledge that claim is still justified.

Apart from its remarkable size, which is in itself a good reason for digitisation, the Dictionary's value consists in various other points, three of which are:

  1. its historical range of 200 years, namely back to 1700, i.e. the beginning of what we now call Late Modern English;
  2. its admirably precise and scholarly method of linguistic description, from the phonetics to the citations;
  3. in line with the policy of the English Dialect Society, its concern with details of cultural history, namely "superstitions and practices in relation to religion, death, witchcraft, apprenticeship, courtship and the like" (Holder 2004:258).

2. Why an electronic version of the EDD?

There is no need nowadays to give evidence of the general advantages of a computerised version of a dictionary, but three specific reasons may be mentioned from a philological point of view. They concern historical English dialectology, studies of spoken English and historical lexicology.

2.1 Historical dialectology

There is still an enormous deficit in historical English dialectology, particularly in the area of word geography. The Cambridge History of the English Language (Romaine 1998), in its fourth volume covering the time from 1776 to 1997, says more against English dialects than in their favour (see Introduction by Suzanne Romaine and the first chapter by John Algeo on vocabulary). Other handbooks convey the same impression (e.g., Walters 1988). Kaiser's statement of 1937 (Vorwort, first page) that "up to now word geography has been a stepchild of English philology", cited by Hoad (1994:197), is still valid in 2006.

There are three major reasons for this neglect:

  1. the eighteenth-century preference for norms and, thus, for the dominant British southern standard (cf. Romaine 1998:7) had its supporters in the earlier nineteenth as well as the twentieth century;
  2. the twentieth century saw the scholars' main interest turn to the systems of languages rather than to their varieties - de Saussure's Cours de Linguistique Générale was published just a decade after Wright's English Dialect Dictionary.
  3. the increase in urbanisation, in line with industrialisation, from about 1840 on has motivated many recent scholars, e.g. Suzanne Romaine (1998:14), to favour sociolectal parameters at the cost of regionalectal ones.

Whatever the reasons, Wright's achievement has often been treated unfairly: it has either been ignored, or otherwise downranked, for exuding too much romanticism, patriotism [4] and - as regards the linguistic method - positivism or traditionalism. However, as McArthur (1992:300) said, "without the mass of data which traditional dialectologists have furnished, theoretical systems could not have been either proposed or refined". Moreover, the 'item-centered' method of traditional dialectology [5] loses some of its horror if the detailed items are made transparent by machine-readability and visualised on the monitor rather than in black and white on paper.

2.2 Historical Spoken English

The second asset of Wright's Dictionary is its contribution to the history of spoken Late Modern English. Although the international phonetic transcription, first officially released in 1888, was not yet in common use, Wright was very consistent in applying a similar system of transcription, thus giving valid phonetic information. But this is not the only aspect of what the EDD provides. The phonetics of dialect words allows us, of course, to see all kinds of phonological processes at work: assimilation and dissimilation, syncope and aphaeresis, cluster reduction and epenthesis. Moreover, the inclusion of phrases and idioms promises to provide new information on collocational sound patterns, colloquialisms, patterns of repetition and deviation as typical of the spoken language.

The research on spoken historical English is naturally scanty. While we all know that the spelling and pronunciation of English words have increasingly diverged since Caxton's introduction of printing, historical English linguistics has too one-sidedly relied on the written language. Books like Bøgholm's English Speech from an Historical Point of View (1939) have remained an exception, [6] and detailed special investigations into spoken Late Modern English are practically non-existent. [7] In view of this state of the art, a new and rich source of features of the spoken language of the 18th and 19th centuries should be welcome.

2.3 Historical linguistics, in particular, lexicology

Wright's EDD could also help to remedy the relatively deficient situation of research on other aspects of Late Modern English, from morphology to pragmatics, but above all to lexicology. If our linguistic interest is not restricted to what used to be called the "standard" of English, a great many subsystems of English between 1700 and 1900 - beyond the basics that scholars such as Görlach (cf. 2001) and Bailey (cf. 1996) have provided - are worth discovering. This is obvious in view of the large amount of material that the EDD offers: more than 60,000 lemmata. Moreover, the value of the dictionary is increased by the high quality of the entries. The EDD was compiled in the same spirit as the OED, which had been started a few decades earlier, and the complexity and substantiality of the entries are equally impressive.

It is, therefore, high time for historical corpus linguists to exploit Wright's magnum opus. But how and to what extent can this be done?

3. Trying out the electronic medium on Wright's EDD

These are two (fairly short) entries of the EDD selected at random (asteep and asteer):

Figure 1

Figure 1. Two entries from the EDD.

3.1 Survey of the structure of the EDD entries

These two entries demonstrate nicely the three main paragraphs that most of the entries consist of. After the capitalised lemma, we first get the main information: word class, dialect area, usage (in the form of labels, such as obsol. for "obsolete" in asteer), and the phonetic transcription. In shorter entries of monosemous words, the meaning simply follows in the paragraph started by the headword; meaning is often given in the form of a phrasal example. After that there is the block of citations with the sources, and then some additional information ("comments"), marked by brackets.

Since the brackets of the phonetic transcription are a reliable formal indicator, we have defined the first five parameters of the entries, up to the transcription, as the "head" and the remaining three ones as the "body" of the entries. Schematically:


  1. lemma, or headword
  2. part of speech, such as v. (for verb)
  3. usage label, such as obs. (for obsolete)
  4. dialect counties and regions
  5. phonetic transcription (not the IPA, but similar to it)


  1. meaning(s)
  2. citations with their sources
  3. comments or cross references

Apart from the fact that some of these fields, in particular that of comment (8), are sometimes empty and also that the unpredictable number of listed meanings disturbs the picture of easily calculable systematicity, we are hopeful that the eight fields can be electronically identified. In this optimism we are supported by the fact that the fields are marked either lexically or by format. For example, there is a large but limited number of usage labels, normally in abbreviation, such as obs., or coll. (for colloquial), and the lemmas are always printed in normal capital letters, whereas the citations are in a separate block of smaller fonts.

The retrieval routines will, we hope, be such that the electronic version of the Dictionary, as in the case of the OED, will allow many more types of questions than the user of the paper version could ever dare to ask. For example, the headwords will be equipped with a morphological boundary marker so that queries for morphemes (such as the dialectally productive suffix -hood, as in barley-hood, billyhood, [8] French-hood) will later be possible.

3.2 Analysis of labels: parts of speech

The label provided immediately after the headword frequently refers to the word class (sb., v. etc.), but it soon became clear that other grammatical functions are often involved, which is why we have called this parameter "parts of speech". There is no need to discuss all the 106 different markers that could be traced in this position. A few examples of the various grammatical domains will do to demonstrate the fact that Wright did not keep them apart (cf. Tables 1 to 3).

Table 1. Syntactic function of parts of speech labels (selection).

Wright's strings

implicit key words

attributable domain

also in comb.



also in phr.



also used advb.







combination (= compound or phrase)

syntax (qua phrase)

improperly used as inf.












Table 2. Morphological function of parts of speech labels (selection).

Wright's strings

implicit key words

attributable domain

also in comp.



also used as pl.



also used as sg.


















Table 3. Phonological function of part of speech labels.

Wright's strings

implicit key words

attributable domain

used before a consonant



used before a vowel



used before vowels and voiced consonants

pre-vocalic/pre-consonantal (voiced)


Tables 1 to 3 illustrate our aim to grasp the information contained in the EDD not only in terms of strings, i.e. mechanically, but also in terms of hierarchically structured functional categories based on linguistic interpretation. Whenever markers affect different grammatical domains, as in the case of the examples in Table 4, this is taken into account in our correlation of these markers with the domains:

Table 4. Mixed function of parts of speech labels.

Wright's strings

implicit key words

attributable domain

v. w. irr.

verb weak irregular


vbl. sb.

verbal substantive/gerund


While Wright had obviously formal grammatical features in mind here, the next field or slot ("labels") provides information on a word's semantics or pragmatics. Here is a small selection of the two types:

Table 5. Semantic function of labels.

Wright's strings

implicit key words

attributable domain

(freq.) used as a nickname



also used fig.



used as a decoy



used as a meaningless exclamation

meaningless exclamation


used as a mild implication

mild implication


used as a quasi-oath or exclamation



used as a term of endearment, sympathy or compassion

endearment/sympathy/ compassion











Table 6. Pragmatic function of labels.

Wright's strings

implicit key words

attributable domain

also used as a familiar term of address



also used as a fencing term



also used as a term of contempt



also used as a term of endearment to (x)

endearment; x = (infant, children)


also used as an epithet of contempt



and in gen. colloq. use






in colloq. use



While labels like these, some 210 altogether, are sometimes difficult to attribute to either semantics or pragmatics, so that manual revision will be necessary, the segmentation and classification will later allow the tracing of semantic word fields as well as pragmatic patterns. In view of the limited work that has been done in Late Modern English pragmatics, the 150-odd labels of pragmatic relevance found in the EDD so far seem very promising.

3.3 Analysing counties and regions

Since Wright's EDD has focussed on dialect geography, the retrieval of dialectal attribution will no doubt be the dominant routine of later work with the Dictionary. The information provided by Wright within this domain comes in three different shapes: first, dialects are referred to in terms of counties, mostly in the three-letter form of what is now generally called the Chapman County Code. [9] The many written sources which Wright used as a basis for his Dictionary (i.e. the glossaries mentioned in the introduction above) are also referred to by these codes. Secondly, reference is often made, again in the form of abbreviations, to larger areas, such as England, Scotland, Wales, Ireland, the North (of England), Scotland's South, etc., as well as to selected overseas regions of the then colonies or ex-colonies, such as America, Canada and N.S.W. (= New South Wales) or the whole of Australia. And thirdly, a considerable amount of information on the dialectal distribution of words is provided in uncoded, sometimes fuzzy terms, such as shown in Table 7:

Table 7. 'Translation' of fuzzy dialect data (selection).




(x) & (y) counties

(n., e., s., w., sw., se., ne., nw.) & (n., e., s., w., sw., se., ne., nw.) counties


(x) also (y)

region x/region y


different counties

different counties


in gen. use throughout dial. exc. in (x)

in general use throughout dialects except in (region)


Such circumlocutionary information, lest it be classified as totally useless and irrelevant, will be subject to a clarifying re-phrasing along the lines of the other, more precise and coded types of dialectal reference. Accordingly, we have 'translated' Wright's phrase "many English counties" into the Chapman tag Eng plus a tag for partial (Eng_part). The logic behind this notation is that there may be references to an entire region or only a part of it. Moreover, the explicit exclusion of certain areas ("in Scotland, but not in the Highlands") inclined us to use a third option in addition to total and partial, namely not. The team members working on the Innsbruck project have not yet fully solved the problems just referred to, but we are confident that a large amount of dialectal information can be elicited from re-definitions of any fuzzy reference in line with given or newly defined codes.

Newly-defined codes - beyond the three-letter codes that Wright has used already - would be needed in view of the occasional reference to larger areas of a local, national or continental dimension (Australia). This is the second problem that Wright's dialectal references confront us with.

The first step towards solving this problem is terminological clarity. Yorkshire, Essex and Stirling are counties, the 'North' and 'South-West' (of England), but also Australia and USA are what we have defined as "regions". [10]

The second step is to keep historical and present-day names clearly apart. As regards the counties, since we are only concerned with the historical British and Irish counties and their names as they existed before the reforms from the 1960s to 1990s, we will use a self-made historical map of English counties, such as Map 1:

Map 1

Map 1. Historical counties and dialect regions of England, Scotland and Wales.

A comparison of this map with maps of the present administrative partitions has shown considerable discrepancy. For example, what used to be Yorkshire now consists of, or comprises parts of, The East Riding (U), The City of Kingston-upon-Hull (U), North Lincolnshire (U), North-East Lincolnshire (U), North Yorkshire, York (U), ten metropolitan districts of Greater Manchester, four "unitary authorities" of Humberside (among these the East Riding of Yorkshire), etc. - 37 units altogether. What makes things more complicated from a historical point of view is the fact that some of these names refer to so-called county boroughs, others to "metropolitan districts" or "metropolitan counties" (M); Greater London falls into a large number of so-called London Boroughs, and many towns and cities are "unitary authorities". [11] To avoid confusion, we will strictly avoid these modern administrative units of Britain.

A clearly historical approach also seems advisable in view of the fact that dialects and their attribution to counties and regions have changed considerably over the last century, particularly in favour of sociolinguistic parameters - so much so that some scholars tend to question the value of traditional dialect atlases altogether. [12] And many dialectologists have supported the notion of transitional areas between dialects, for example, Francis (1983:5-7, 150-158).

However, works on dialect geography such as the Linguistic Atlas of England were written nevertheless (by Orton, Sanderson and Widdowson 1978), and all more recent dialectologists had to come to terms with the problem of fuzzy edges in dialect attribution. Faced with this problem in the SPEED project as well, I see two optional ways out:

  1. If there is no clear correlation between a feature or word and a specific English county, we will take refuge in the tag "partial", as mentioned above.
  2. In order to correlate the detailed information given on the 39 (historical) English counties - to mention only these - with information given on neighbouring counties, we follow Wright, both in his EDD and in his English Dialect Grammar (1905), in working with "regions", i.e. areas of medium size between the counties and the nations. In line with Wright's and his time's standard terminology (Ellis 1912), we will distinguish eight English, seven Scottish, two Welsh and three Irish regions, plus Northern Ireland (as one region) and English-speaking overseas regions.

At the moment we are working on the implementation of a retrieval routine based on the correlation of regions and their counties. The computer will thus 'know' that Cumberland is in the North of England and Angus in the West of Scotland. A query for, say, Northern words or features will activate all information on northern counties available in the EDD.

4. Conclusion: possibilities almost unlimited - invitation to use the EDD

Aware that Wright's 5,000-page Dictionary has gaps and - unavoidably - mistakes, the team members of the project SPEED are hopeful that the electronic version will allow welcome access to an enormous amount of dialectal historical material. We have outsourced the implementation of the query routines to programmers, who are already working with us on the mode of combined retrievals of morphemes, parts of speech, usage labels, and other features. Given the poor state of diachronic Late Modern English dialectology, the English Dialect Dictionary, once available on the Web, will represent the state of the art for this branch of historical English linguistics. My Innsbruck team and myself hope that the international community of researchers will accept the new platform as their own and start thinking about the many philological questions that can be raised once the platform can be accessed. This may be reckoned with for the first half of 2008. As I prepare this paper for publication (November 2006), a beta version of the EDD, used for correction, is already available.


[1] Cf. Barnes 1886, Dickinson and Prevost 1879, and Long 1886.

[2] An interesting source on Wright's life and career is the biography written after his death by his widow, Elizabeth Mary Wright (1932).

[3] For a detailed description of Wright's childhood, see Holder 2004:229-241.

[4] Cf. Romaine 1998:16, 48-50, about the Romantic and nationalist motivations.

[5] Cf. Francis 1983:150-158, for a detailed theoretical discussion; also Viereck and Ramisch 1997.

[6] The book deals with the whole history of the English language, so there is not much space for Late Modern English.

[7] This is evident from a check of relevant keywords in the MLA Bibliography, which covers the time span from 1961 to 2005.

[8] Barley-hood 'a fit of ... drunken, angry passion'; billyhood 'brotherhood'.

[9] This standard of abbreviating the names of the pre-1974 British counties was compiled by Dr. Colin Chapman in the early 1970s (cf. Chapman 1993). It is surprising that Wright anticipated this coding system.

[10] We first thought of keeping apart 'regions', e.g. North, and 'nations', such as the USA, but since reference to countries or nations as a whole is extremely rare, we preferred to merge the two levels after all.

[11] For such details cf. the article "Administrative Areas of England". 25.4.2006. http://www.genuki.org.uk/big/Regions/England

[12] Trudgill has summarized these tendencies of the last century, calling the concept of 'pure' homogeneous dialect "largely a myth": "all language is subject to stylistic and social differentiation". (Trudgill 1983:37).


