Editing conventions

Parameter coding

Each interview is provided with five parameter codes adapted from the Helsinki Corpus of English Texts (HC). The parameters give

  • B: the HD file code (<B DICAM13>)
  • N: the county, village, informant information (|N_CAM_LANDBEACH_SJ)
  • Y: the age(s) of the informant(s) (<Y 86>)
  • X: the sex of the informant(s) (<X FEMALE>)
  • H: the occupation(s) of the informant(s) (<H HOUSEWIFE>)

After the main parameter codes the following details can also be seen:

  • the digital archive code (Dig. CAM32A)
  • the length of the tape in minutes (47:37 min)
  • the year or date of the recording with the fieldworker's initials (Rec 1974 by AO)
  • the initials of the transcriber (Transcribed by AO)
  • the total and the stripped word counts (Our WC 6,075, stripped WC 5,796)
  • the number of archive pages (41 archive pages)
  • the date of the final computer manuscript (CMS 9.1.2001)


<Y 86>

[Dig. CAM32A. 47:37 min. Rec 1974 by AO. Transcribed by AO. Our WC 6,075, stripped WC 5,796. 41 archive pages. CMS 9.1.2001]


The corpus contains orthographic transcriptions of spoken dialect speech. The most important aspect of creating a uniform and coherent corpus is to ensure that the recorded speech data be presented faithfully and consistently. This means transcribing exactly what the person said regardless of whether it follows the so-called 'rules' of the standard. Thus all the utterances of the informants in the dialect corpus are described as carefully as possible, including repetition, hesitation, interrupted words and inarticulated sounds, i.e. features that are not present in written standard language. The transcriptions in WordCruncher are untagged and devoid of IPA (International Phonetic Alphabet) symbols.

Then = the ol' Turner started, a big red bus, like the buses we got now. And = course I mean they used to = that was just the jobs = to go on that an' course what we = whatever did we used to pay to go on that for a start, very, very little, well, I mean what we have to pay now. [LAUGHING] Yes. (Editorial conventions are discussed below.)

The principles of compilation and the editorial conventions used in the Helsinki Corpus of British English Dialects (HD) follow those in the Helsinki Corpus of English Texts (HC), differing mainly in the emphasis on spoken material (HD) versus written material (HC).

Since no standard, unanimously accepted methodology for transcription exists, the transcribers are left to make the decisions on any practical approaches themselves. Problems the transcriber must face include, for example, the dichotomy of detail versus readability. The transcription should retain enough information to facilitate efficient linguistic analyses but should also be simple enough to ensure readability. The transcriber must also decide to what extent to describe discourse features, i.e. overlapping speech, prosody, hedges, etc. Too detailed description of discourse will render the transcription difficult to use in research other than discourse analysis. Also, dialect features in the lexicon must be distinguished from those of the so-called "eye-dialect", i.e. words that are spelled so as to look like dialect but the pronunciation of which doesn't actually differ from standard speech (ev'ry, hun'red, gran'mother).

A typical feature of dialect speech is that it proceeds in long sequences of paratactic units, with little connection between pauses and grammatical boundaries, and even without any indications of changes in the topic. Thus it is necessary for transcribers to define the concept of sentence in the context of their research. Since spoken language hardly follows the grammatical structuring of sentence and clause elements, some other method must be used to describe the speech unit (e.g. Anna-Liisa Vasko uses the term meaning unit in her research). Nevertheless, orthographic transcriptions of dialect speech conventionally contain sentence-final punctuation marks (full-stops, question marks and exclamation marks). These can reflect a variety of cues (e.g. pauses in the speech, falling intonation) in addition to grammatical structure. Thus, although the punctuation in dialect transcription is intended to help the reader, it does not necessarily follow the rules of the standard.

Sample of transcription from the Cambridgeshire subcorpus:
Ditches as well. They used to dug out, they uset' clean all around the ditches out, by hand = spade and shovel.
Yeah. Now they got mechanic diggers now = do all on it. 'At 's done away all that. [LAUGHING] Ah yeah. An' they don' supply half the labour = not half the labour 's supplied now, I known twenty-five worked down here where I been work.
Twenty-five there used to be down there. Now there 's eight = run the lot. That 's includin' stockman an' all. They let you know the difference roun', in farmin' today. Ha ha. Yeah. (West Wickham CC)

When referring to a passage from the corpus, the village and the informant's initials are given in brackets at the end of the passage.

Editorial conventions used in HD are as follows:

  • The people referred to are anonymised by replacing their names with asterisks, one for each syllable of the name. Only exceptionally, for the sake of the clarity of the example and the interpretation, is the full Christian name or surname of the person in question given.
  • The initials of the informant are given before the example if the example is an extract of a dialogue between two informants, e.g. [CM:] What year would that be when I went up = the co-op? - [EM:] When I went in hospital. If the recording consists of a sole informant, his/her speech is not labelled. The informant's speech is written in small letters. Also, if an informant's friend or relative speaks during the interview, their replies are also labelled, e.g. [MRS H:] in the text sample below.
  • If the interviewer is not the fieldworker, his/her questions are capitalised and labelled with his/her initials. The fieldworker's questions are labelled with [Q:].

    [ES:] Slope Alley.

  • = indicates a break in speech that occurs where it is not expected, i.e. not on natural punctuation boundaries: Yes, I were = born here = in this 'ouse. The symbol also replaces punctuation in the case of an unusually long break at punctuation boundaries. The length of the break is not indicated.
  • A break in the recording is indicated with [BREAK].
  • Capital letters in square brackets are used for additional remarks concerning noise, etc. [COUGH], [DOG BARKING].
  • %---% indicates that the passage is unintelligible. The length of the passage is not indicated. A tentative interpretation may also be given to show how it sounds %have gone%.

The apostrophe is used to indicate sound-dropping in word-initial position ('way for away), and word-final position (tha' for that).

Sample content from the Cambridgeshire subcorpus displaying the variety of editorial conventions used in the corpus:
[EF:] No.
[TR:] No, nor shan't I. I haven't been for several years =
[EF:] %---% I 'm too old to go with 'orses.
[TR:] = I ain't been for several years.
[EF:] [SIMULTANEOUSLY] %---% horsekeeper.
[TR:] Th' last time I were went t' my cousin *'s, somebody else wanted to go = "Well," I said, "you =
[MRS H:] [SIMULTANEOUSLY] Nobody kept horses like that.
[TR:] = you take him an' I 'll stop at home." And when they were comin' home they had to pull in a lay-by, an' as the big boaters went by the water, went right over th' top of their car. I were glad I didn't go.
[EF:] I used to have twelve.
[EF:] Five in the mornin' till six at night = sometimes seven.
[TR:] Ah, I 'm pleased t' see you, **. I I knowed you were about [EF LAUGHING]. Yeah, if ** * 's a-comin', you come last Saturday night, didn't you? (Rampton TR+EF)