Inferring syntactic variation and change from the Newcastle Electronic Corpus of Tyneside English (NECTE) and the Corpus of Sheffield Usage (CSU)

Joan C. Beal, University of Sheffield
Karen P. Corrigan, Newcastle University


Taking the point made by Laurie Bauer (2002: 107-108) that "we do not have public electronic corpora that would allow us to investigate differences in the syntax of Newfoundland and Vancouver Englishes, or of Cornish and Tyneside Dialects", this paper demonstrates ways in which corpora derived from previously-conducted sociolinguistic surveys can be used to make such comparisons. In particular, we report on research which examined data from the NECTE corpus and the Corpus of Sheffield Usage in order to investigate diachronic and/or diatopic variation in relative markers, adverbials and quotatives.

1. Introduction

The research reported in this paper arose as a response to the following statement by Laurie Bauer:

On the whole corpora have been built for national varieties of English rather than for regional dialects within one country. Thus we do not have public electronic corpora that would allow us to investigate differences in the syntax of Newfoundland and Vancouver Englishes, or of Cornish and Tyneside Dialects. (Bauer 2002: 107-108)

Although we do not as yet have corpora that would allow us to compare the specific varieties mentioned by Bauer, there has been considerable progress in the compilation of corpora of non-standard regional and national varieties of English in the first decade of the 21st century, as can be seen from the range of papers in Beal, Corrigan & Moisl (2007). Corpora of regional varieties currently being compiled include the Helsinki Corpus of British English Dialects. National corpora, particularly the sub-components of the International Corpus of English (ICE) project, are designed to be of similar size and to consist of rather specific genres so that data can be systematically compared across corpora. Others, however, consist of various types of legacy data collected over different time periods and compiled using quite idiosyncratic methodologies which have been ‘rescued’ from obsolescent formats and then made available for new research agendas (see Kretzschmar et al. 2006, Beal 2009). One of the problems envisaged by Bauer is that corpora collected for different purposes may not be comparable as far as their data-sets are concerned though they may appear to be so on the surface. In this paper, we consider whether we can therefore successfully use legacy data to infer variation and change.

There is, indeed, considerable disparity in the models and methods used both in the compilation of digital corpora and in their subsequent encoding and analysis. Not surprisingly, this is largely because the underlying theoretical goals and assumptions of the researchers are quite distinctive.

(1) There are marked differences, for instance, in the nature of the data recorded. Thus, the Corpus of Sheffield Usage (CSU), Freiburg English Dialect Corpus (FRED), the Newcastle Electronic Corpus of Tyneside English (NECTE), the Roots of English corpus compiled by Sali Tagliamonte (Tagliamonte Forthcoming) and the data sets compiled by Jenny Cheshire (1982) for her study of variation in Reading are all corpora of spoken English. However, FRED differs from most other corpora in that it was primarily collected from oral history projects rather than by sociolinguistic interview. Even the 1960’s half of NECTE differs somewhat from the 1990’s half in that the latter used a dyadic interview technique during which the interviewer was much more minimally present than in the more classic one-to-one style of the earlier material. The ICE corpora are unique as regards the range of data available for mining. Each ICE corpus includes 300 texts of spoken language and 200 texts of written language (where ‘text’ in each instance is defined as ‘c. 2000 words’), yielding a total in each sub-component of 1 million words. The genres from which these texts originate are highly diverse and include the full gamut from face-to-face interviews and unscripted political interviews through to administrative and regulatory prose.

(2) The corpora also vary in the levels of phonetic, lexical, grammatical and semantic annotation that they encode. NECTE, for instance, has almost no semantic annotation, though this is clearly a crucial level to encode for Cheshire, given her focus on socio-phonetic variation as well as issues of discourse style and pragmatics. The primary concern to date of FRED, the CSU and the Roots of English corpora has been with grammatical variation, so none of these attends to the phonological component in the ways that ICE-Ireland, Cheshire’s corpora and NECTE do. Although Kallen and Kirk, who have compiled the Irish sub-component of ICE, are now working on annotating their transcribed orthographic corpus with pragmatic coding, they are also undertaking a mark-up of certain aspects of its phonology. Their annotation focuses, in particular, on intonation patterns so that while the end-product will resemble aspects of Cheshire’s and the NECTE corpora, ICE-Ireland as a socio-phonetic resource will be incomplete in certain key respects.

(3) There are also differences between corpora with respect to the manner in which information is accessed / retrieved and the way in which it is displayed. Thus, the CSU and FRED, at least at this point, are essentially private corpora as are the Reading and Hull databases collected in the 80’s and ‘90’s by Jenny Cheshire, Paul Kerswill and Ann Williams (Cheshire 1982, Kerswill and Williams 2000, 2005). By contrast, the ICE corpora and NECTE were developed specifically to function as public resources.

The points of contrast that we have just highlighted may lead the reader to conclude that we should simply give up from the ‘get-go’ on any attempt to infer variation and change using these corpora comparatively. Nevertheless, one of the ideas that we would like to demonstrate is that, despite the disparities between these corpora, comparative work with them should not be ruled out entirely and can provide really important insights, providing we are mindful of the potential limitations of the exercise itself.

2. The CSU and NECTE: Fieldwork and corpus construction

The next section of our paper will be divided into two parts: (1) will outline the fieldwork and corpus construction methods of the CSU and NECTE and then (2) will demonstrate some of the results we have obtained from pilot studies in which we compared the corpora with each other and with research output from the corpus of York English compiled by Sali Tagliamonte for the Roots of English project and from Jenny Cheshire’s earlier work in Reading. The particular features we will discuss are: the relativization system; dual-form adverbs; and quotatives.

Although studies comparing the Brown and LOB corpora, for instance, have revealed very few significant differences between American and British Englishes, the findings of other research on different databases have demonstrated that the contrasts between varieties of English are not at all trivial. With a view to revealing just how crucial digital corpora of the right kind can be in highlighting these differences, we turn now to examining the fieldwork methods and corpus construction techniques behind CSU and NECTE before presenting some preliminary morphosyntactic analyses of these.

Two major surveys of Tyneside English and one of Sheffield English have been carried out in the second half of the 20th century. The Tyneside Linguistic Survey (TLS), for which fieldwork was conducted in the late 1960’s, and the Phonological Variation and Change (PVC) project, undertaken in 1994, were carried out in Gateshead and Newcastle respectively. The data from both these surveys has been incorporated into NECTE. TLS Informants were selected by means of a random sample stratified by “rateable value per dwelling by polling district” (Pellowe et al. 1972: 24) and free-form interviews were recorded in subjects’ homes by a single interviewer, who had lived in Gateshead all his life and spoke with a local accent. The average length of interview was 30 minutes. Of these interviews, 83 are extant, though 150 were originally planned. The PVC data was collected in the Chapel House and Newbiggin Hall areas in the west of Newcastle. Informants were selected using a social network model (Milroy & Milroy 1985), "divided to sample the community along parameters of age, gender and broadly-defined socio-economic class" (Watt & Milroy 1999: 27). They were recorded in dyadic pairs (friends or relatives), with the fieldworker deliberately keeping out of the conversation and remaining as inconspicuous as possible so as to mitigate the observer’s paradox. Although only 32 speakers make up the PVC corpus, a total of 20 recordings were made, involving 39 speakers (one was recorded twice with different partners) and all of these have been incorporated into NECTE. [1]

The Survey of Sheffield Usage (SSU) was conducted in 1981 under the direction of Graham Nixon, one of the initial team of researchers for the TLS. The sampling methodology and interview structure were, therefore, identical to those of the TLS, but, instead of using one interviewer, the SSU had several students conducting about 10 interviews each. The average length of SSU interviews, like the TLS ones, is 30 minutes. 100 interviews were conducted, and the original tapes are catalogued and stored in the Archives of Cultural Tradition, University of Sheffield. In 2000–2001, 52 of these interviews were digitized (those whose subjects were Sheffield-born) and in 2002, a small grant from the British Academy was secured in order to have these orthographically transcribed. To date, there are 15 complete transcriptions and it is this electronic corpus which forms the database of Sheffield material analysed in this paper.

In order to ensure maximum accessibility – and ultimately transferability of these materials for comparative purposes by ourselves and other researchers – the sound recordings were transcribed according to the ‘tried-and-true’ transcription protocols adopted in other successful digital dialect corpus projects such as those by Poplack and Tagliamonte in the late ‘80’s and early ‘90’s. We also incorporated the more recent transcription protocol developed by Tagliamonte in the late 90’s for the York projects (see particularly Tagliamonte 2007). We are very grateful to her for introducing us to this system and allowing us to adapt it for our own purposes, by which we mean we subjected a number of features specific to the Tyneside and Sheffield dialects to a more detailed protocol to reflect local dialect lexemes and morpho-phonology (see Beal et al. 2007).

There are a number of language-external reasons why one might expect differentiation between the dialects of Newcastle and Sheffield. Newcastle is in the Northern part of the old Anglo-Saxon kingdom of Northumbria but Sheffield is a border city, marking the southern limit of Northumbria and the border with Mercia. Sheffield and Newcastle are both on the fringes of what was historically the Danelaw, but we might expect to find more evidence of Scandinavian influence in Sheffield than in Newcastle, because Scandinavian settlement, and therefore, language contact, was much less extensive in what is now present-day Newcastle. The same can also be said of these communities as late as the Middle Ages. Poussa (2002), for instance, provides evidence that Tyneside, as a coastal region, was considerably more exposed to the influence of Dutch and Flemish dialects via trade links than Sheffield will have been on account of its rather different geographical location. Similarly, Newcastle, but not Sheffield, was the locus for considerable numbers of immigrants of Celtic English speakers from both Scotland and Ireland in the late nineteenth century (Beal & Corrigan 2009). All of these factors may have important consequences for synchronic distinctions between these vernaculars and other Northern Englishes, like that epitomized in the York corpus collected by Tagliamonte and others, for example, because York was a central city of the Danelaw and, like Sheffield, has experienced relatively little in-migration throughout its history.

2.1 The relativization system

Recent investigations into the relative formation strategies employed in contemporary spoken non-standard British and extra-territorial Englishes suggests that, in many of these varieties, the typical ratio of WH- to TH- and Ø relatives lags behind that of present-day Standard (English) English and that there has been a dramatic increase in the zero variant over time, hypothesized to be the result of two language-internal factors, namely: (1) ongoing grammaticalization and/or (2) grammatical complexity. In addition, it has been suggested that certain relative variants are differentiated within dialects as a result of language external factors.

Although the sample size used for this study is relatively small, these tendencies are supported by our findings for both Tyneside and Sheffield in Figure 1:

Relative Clause Marking in NECTE and the CSU

Figure 1. Relative clause marking in NECTE and the CSU.

The later corpora – the CSU of the 1980’s and the PVC of the 1990’s show an increase in the usage of the zero relative variant by comparison to the 1960’s TLS corpus – the figure for the most recent sub-corpus (PVC) being three times greater than that for the TLS some thirty years earlier.

Examples of zero relative markers from the two corpora are:


I’ve a mother’s still living, she’s a widow (SSU 017)


I’ve a sister ’s over there, she loves stotties (TLS G52)

There are language-external differences too. Thus the Newcastle samples have considerably higher proportions of WH-relatives, whilst the CSU figures are dramatically lower. In the PVC corpus, TH- and WH- relatives are evenly distributed, whilst speakers in the CSU subsample employ the former slightly more frequently than they do the latter. The use of what as a relative pronoun (as in My play what I wrote) is almost nonexistent in the Tyneside subcorpora, but accounts for almost as many relative clauses as the ø variant does in the CSU database.

Examples of wh-, that and what relative markers from the two corpora are:


I was having a relationship which was lacking in loads of stuff (PVCT10)


There’ll be a canny few sixth-formers there who’ll be starting the year anyway (PVCH11)


You always get you-know the odd one like who’ll go to t’ wine-bar (SSU038)


They’ve captured a lot of countryside which is brought into Sheffield now (SSU077)


I’ve got two other sisters that are both working (SSU 015)


Bairns don’t play the games what we did. (TLS G59)


It’s double t’ money what you’re getting at home (SSU 014)

Corpus-based studies of relativization in SE and other English dialects (cf. Ball 1996; Beal 1993; Corrigan 2008; Miller 1988, 1993; and Tagliamonte 2002) have demonstrated the importance of distinguishing between different antecedent types, with regard to both the grammatical category and the animacy of the antecedent. Since non-restrictive relative clauses favour WH- even in the otherwise highly non-standard dialects of Tyneside and Sheffield, and which is categorical in sentential relative clauses in both vernaculars, it is only in restrictive relatives that variation between WH-, TH- and Ø may be constrained by the nature of the antecedent. Hence, Figure 1 has been re-analysed as Figures 2 and 3 in which only restrictive relatives are counted and they are distinguished according to sub-type.

% Restrictive Relative Clause Markers in the NECTE Corpus

Figure 2. Restrictive relative clause markers in the NECTE corpus (%).

% Restrictive Relative Clause Markers in the CSU Corpus

Figure 3. Restrictive relative clause markers in the CSU corpus (%).

A comparison of this data would seem to confirm the point we made earlier, namely, that the WH- strategy is more prevalent overall in Tyneside than it is in Sheffield. They also show the interesting distribution of the what relative in these dialects. Of particular note in this context, is that, whilst never a majority choice, the what relative marker is found with all antecedent types in the SSU corpus.

Since an important objective of this paper is to assess the comparative method with respect to other dialectal digital corpora, we thought it would be useful to compare our findings in this regard to those reported for the Reading corpus collected by Jenny Cheshire in the early 1980’s. Interestingly, the pattern of what relative usage across all antecedent types – which we just saw in Figure 3 – is similar to that found in Cheshire’s (1982) study given in Figure 4.

% Restrictive Relative Clause Markers in the Reading Corpus

Figure 4. Restrictive relative clause markers in the Reading corpus (%).

The level of overall employment of this strategy is higher in Reading than in Sheffield, but the proportion of use between antecedent types is similar. In this respect, then, the Sheffield dialect is patterning more like a southern variety than the northern Tyneside vernacular appears to be. The Survey of English Dialects (SED), based on data collected in the 1950s from informants then aged 60+, shows what appearing in three Eastern areas: the area around London, the Wash, and around the Humber estuary. There is also more recent evidence that what is found in urban dialects neighbouring that of Sheffield. In particular, Petyt’s 1985 study of the dialect of West Yorkshire (Bradford, Halifax and Huddersfield) gives evidence of 8 informants (out of 106) producing examples of what relatives (Petyt 1985: 238). The latter could have spread to Sheffield and West Yorkshire from either the Humber or the Wash well before the CSU data was collected, or the phenomenon could be more recent, and be due to the influence of London. Clearly, this is an area ripe for further investigation – particularly in the light of Tagliamonte’s (Forthcoming) findings with respect to relativization strategies across the dialects investigated in her Roots of English project.

2.2 Dual form adverbs

Given the significant contribution by Tagliamonte and Ito’s recent research to our understanding of the linguistic and extralinguistic factors governing the usage of dual form adverbs in York English, we thought that it would be an interesting exercise to ascertain the extent to which constraints operating on this variable in York, Sheffield and Newcastle are shared. As such, one might use this kind of evidence to either refute or support the view of Trudgill and others (based entirely on phonological criteria) that northern dialects retain features from earlier stages of English more readily than their southern counterparts do (Trudgill 1990: 65-78). We were also interested to know whether the tendencies reported elsewhere and summarized as (i) to (iv) in Table 1 are matched in either Newcastle or Sheffield (or both). By circumscribing the variable context in exactly the same manner as Tagliamonte and Ito (2002) (making sure that the dual form adverbs were semantically-related, for instance), such a comparison might provide preliminary evidence for a British Northern Cities shift on morpho-syntactic grounds which is reminiscent of the chain vowel shifts reported for the US in Labov (1994). On the basis that Tagliamonte and Ito (2002: 256) also argue for a linguistic constraint on adverb formation such that the choice of the zero variant, for example, is thought to be semantically conditioned, this section of our paper will briefly explore the extent to which this is, as one might predict, a pan-northern phenomenon.

Table 1. Reported patterns of variation and change in the system of adverbial formation (Tagliamonte & Ito 2002: 258).

(i) {-Ø} variant is older and was gradually replaced by the {-ly} variant in SE;
(ii) intensifier adverbs (really, awfully, terribly) and manner adverbs (properly, sorely) appear to have differential status across geographical space;
(iii) the variants have become sociolinguistic markers for certain social groups;
(iv) there is a “propensity for {-ly} with abstract meanings and {-Ø} with concrete meaning”.

Let’s have a look now at dual-form usage in York and compare it with our findings for Newcastle and Sheffield:

Distribution of {-Ø} Adverbs by Age in York English

Figure 5. Distribution of {-Ø} adverbs by age in York English (after Tagliamonte & Ito 2002: 250, reproduced with permission).

Distribution of {-Ø} Adverbs by Age in Sheffield and Newcastle

Figure 6. Distribution of {-Ø} adverbs by age in Sheffield and Newcastle.

On the basis that, as we can see in Figure 7, Tagliamonte and Ito’s (2002) research revealed that educated women seemed to spearhead the change towards the standard variant, we thought that it might be useful to ascertain – as far as we could, given the much smaller corpora with which we are currently working – whether gender and education were also important factors in the shift towards the really intensifier in both Newcastle and Sheffield.

Distribution of {-Ø} Real by Age, Sex and Education in York

Figure 7. Distribution of {-Ø} real by age, sex and education in York (after Tagliamonte & Ito 2002: 252, reproduced with permission).

The results for this are given in Figure 8. The patterns for each community seem to be similar and education, in particular, appears to play an important role with dramatically higher levels of real used by young school leavers, irrespective of their gender.

Distribution of {-Ø} Real by Sex and Education in Newcastle and Sheffield

Figure 8. Distribution of {-Ø} real by sex and education in Newcastle and Sheffield.

Examples of real from the two corpora are:


I think they talk real nice (TLSG518)


Anyway they set me on though fortunately we’d got real pally (SSU 059)

Given the distributions for other zero adverbs in York, as illustrated in Figure 9, Tagliamonte and Ito (2002: 258) propose that the variant “bears one of the classic characteristics of a sociolinguistic marker” in that it is used in the “identification of a particular social group” – in this case the less educated men.

Distribution of {-Ø} Other Adverbs by Age, Sex and Education in York

Figure 9. Distribution of {-Ø} other adverbs by age, sex and education in York (after Tagliamonte & Ito 2002: 253, reproduced with permission).

Men who have not had post-16 education, use the zero adverb at the highest frequencies whereas York females, by comparison, are moderate users – irrespective of whether or not they did or did not avail of these opportunities.

In fact, all speakers in the community use the zero adverb some of the time and, as Figure 10 demonstrates, this is also the case for both Newcastle and Sheffield.

Distribution of {-Ø} Other Adverbs by Sex and Education in Newcastle and Sheffield

Figure 10. Distribution of {-Ø} other adverbs by sex and education in Newcastle and Sheffield.

Examples of zero-marked adverbs from the two corpora are:


I think they get off far too light at school (TLSG044)


I pull them up even though we talk different (SSU039)

There are, however, a couple of interesting patterns in this data, which do not replicate the distribution of zero adverbs in York. In the first place, the pattern for Sheffield regarding generalized use of zero variants amongst the entire community more clearly matches that of York than the Newcastle data does, since educated females in NECTE show near categorical use of {-ly} variants and there is a clear demarcation in both regions between early and late school-leavers: NECTE 11% versus 1% amongst females; 12% versus 4% amongst males and in CSU: 26% versus 7% and 15% versus 5%.

In this way, the results for Newcastle especially are similar to those of Macaulay (1991) for Ayr in Scotland, which is not unexpected given Newcastle’s geographical location close to the Scottish border and the number of Scottish migrants that it supported in the nineteenth century. Moreover, the distribution of Ø adverbs in Sheffield, but not in Newcastle, appears to indicate that although the zero adverb does act as a sociolinguistic marker in this community, it is particularly associated with uneducated women rather than uneducated men as it was in York.

2.3 Quotatives

One of the most widely-researched innovations in recent English is the introduction of the ‘new quotatives’, forms such as go, be like and be… all (see, for instance, Buchstaller 2006, Butters 1980, Macaulay 2001, Romaine & Lange 1991, Tagliamonte & D’Arcy 2004, Tagliamonte & Hudson 1999). Butters (1980) suggests that, at that time, the use of go to introduce quotations in narratives was an innovation, but by 1991, Romaine and Lange are already reporting on the displacement of go by be like. The latter construction seems to have been introduced first in American English and to have been more frequently used by younger female speakers, but in the course of the 1990s it spread to Canada, the British Isles, Australia and New Zealand.

Since the NECTE corpus consists of recordings of younger and older speakers from 1969, i.e. before quotative go was introduced according to Butters (1980), and from 1994, when be like was a relatively novel construction in British English, we might expect to find evidence of the introduction of these new quotatives in the corpus.

Park (2004) conducted a real-time study of quotatives, comparing the use of say, go, and be like in a subsample of 15 speakers from the TLS sub-corpus and 20 from the PVC sub-corpus. Examples of quotatives are shown in 14–17 below, where 14 and 15 represent the usage of young speakers recorded in 1994, and 16 and 17 that of speakers recorded in 1964.


Everyone thinks it’s dead strange. Sitting at dancing, and everyone’s going “Oh Emma, who are you going on holiday with?” and she’s going “Her boyfriend” and they were like “ friend’s going on holiday with your boyfriend?” (PVC female, middle-class, age 17)


Can you imagine, it’s just like “Wahay!” ... bouncing off your lip and-that.. Well me-- me sister’s kitten did that to me .. she told us to pick it up, .. so I picked up .. and I turned it round, and it went “Whachoo” and it dug its little claws in beside me .. and it like hooked underneath, .. because I had these really stupid marks on me face and I was like-- and I got any amount of stick from the lads .. “Well how do you get that Brian? oh …I said “No, no” (PVC male, working-class, age 17)


And it eh-- it used to shut at ten then and the landlady come in, now there's eh two or three chaps come in half-past-nine, they wanted to take our seats, said it was-- that it was their seats. I says “how can it be your seats, we've been sitting here since six-o-clock”. Even the landlady come in, she says, “Oh yes they sit-- they sit there.“ Well I said, “they're not getting it tonight.” (TLS male, working-class, age 68)


I went intiv a place y- eh yon side of the Redheugh-Bridge. I-- When I went there they were making these liners, and the gaffer says to me, “Can you make them?” Wey I says, “I should think so like, I served me time.” (TLS male, working-class, age 38)

Park found that the only quotative occurring in his subsample of the TLS sub-corpus was say, as demonstrated by examples 16 and 17 above. Even in the PVC subcorpus, only the youngest speakers (born c. 1978) used go and be like as in examples 14 and 15 above and they tended to use go, be like and say variably. Since only the youngest speakers in the PVC subcorpus showed any variation in their use of quotatives, Park was only able to investigate variation according to gender and social class within this age cohort, as shown in Figure 11. The highest percentage use of both go and be like was from young middle-class females, as in example 14.

Variation in the use of quotatives in the PVC sub-corpus

Figure 11. Variation in the use of quotatives in the PVC sub-corpus (from Park 2004, adapted with permission).

Park’s study was limited to a very small sample, but it does corroborate the findings of other scholars who have investigated variation and change in the use of quotatives in English. Butters (1980) considered go an innovation and provides a citation from 1975, which was six years after the TLS data was collected, so it is not surprising that this usage is absent from the TLS subcorpus. Park’s study might lead us to believe that go was introduced later in British English than in US English, but this is not the case. In a separate corpus of adolescent Tyneside usage collected by Beal in 1975, there are instances of quotative go in narratives such as example 18, from a thirteen-year-old girl.


We thought oh yes he’s trying to get a lift here and we were gonna give him a lift and he goes “oh these are no good to me just give them away” we said “oh thank you”.

On the other hand, Park’s finding that use of quotative be like was confined to adolescent speakers in 1994 and used most frequently by young middle-class females confirms the trajectory reported by other scholars: be like was introduced to British English in the early 1990s and young, middle-class females were in the vanguard of this change.

3. Conclusion

The three case studies presented thus far demonstrate that, even when corpora have not been constructed using data from the same dates and using exactly the same methods, useful comparisons can be made. The availability of these corpora allows for:

  • 'Real' and 'apparent' time studies of individual varieties (relatives)
  • Comparison of different geographic varieties (relatives, adverbials)
  • Identification of new patterns (quotatives)

As new corpora of regional and national varieties of English continue to be compiled both from legacy materials and as purpose-built corpora, opportunities for comparative studies such as those presented here will increase such that Bauer’s statement quoted at the beginning of this paper will become obsolete.


[1] Since 2007, additional interviews following the methodology of the PVC project but with the addition of word lists and reading passages have been conducted to form NECTE2. The latter is in the process of being amalgamated with NECTE to form the Diachronic Electronic Corpus of Tyneside English.


