Playing tag with category boundaries

David Denison, Linguistics and English Language, University of Manchester

1. Introduction

The background to this paper is a long-standing interest in linguistic change in English, combined more recently with a theoretical concern with the boundaries between morphosyntactic categories (or word classes), to be worked out in Denison (in preparation). Neither of these interests qualifies me as a corpus linguist. Although I have had some experience of corpus construction, these were with one exception untagged, text-only corpora: I am only a casual, opportunistic user of tagged corpora. And what I find is that the tagging provided by many corpora can be at the same time a boon, an obstacle and an object of interest in its own right.

For this paper I have taken as case studies five sets of data which fall on or close to word class boundaries. All are interesting for historical reasons, and I am going to look at several corpora to see how they deal with such matters. In the event I will confine discussion to corpora of Present-day English (PDE). The corpora I have consulted are the British National Corpus (BNC), the second edition of the American National Corpus (ANC2), two overlapping corpora from the Survey of English Usage, namely the Diachronic Corpus of Present-Day Spoken English (DCPSE) and the British Component of the International Corpus of English (ICE-GB), and occasionally the tagged Lancaster-Oslo-Bergen Corpus (LOB). [1]

2. Case studies

2.1 Case study 1: N ~ A

One kind of category boundary is that between Noun and Adjective. In certain circumstances, the boundary becomes obscure and there is the possibility of transition from N to A. I have argued that this transition is stepwise rather than all-or-nothing (Denison 2001). Well-known examples are fun and key, more recent ones draft, genius and (perhaps) rubbish (though adjectival use of rubbish may be a revival rather than a complete innovation). Here are some recent examples of draft as adjective:

(1) This is really quite draft at the moment. (attested Nigel Vincent, p.c. 22 Feb 2006)

(2) Subject : RE: Very draft mission statement for the GIS SC (http://lists.oasis-open.org/archives/emergency/200507/msg00008.html 5 Jul 2005)

(3) It's extremely draft (I think Tom wanted me to post it as an article). (http://lxer.com/module/forums/t/22188/ 24 Mar 2006)

The circumstances which license such a transition seem to include:

  • lexical gap = absence of an adjective (morphologically related or otherwise) with appropriate semantics
  • N is, or can be, a mass noun or at least can be used without D (an article in draft, a work of genius)
  • N is semantically gradable

What is the problem? I will illustrate it with key, for which I have much more data; for discussion see (Denison 2001, Leech & Li 1995). Historically, key is certainly a noun. Consider key in its figurative sense, often moving towards the sense 'essential or most important (factor)'. I assume that any tagged corpus would mark a usage like (4) as N:

(4) The oceans are the key_N to understanding changes in global climate <ICE-GB:W2B-025 #3:1> [FTF] [2]

The first edition of OED, the New English Dictionary (1928), did not recognise any adjectival usage for key.

On the other hand, if a corpus contained such comparative and superlative usages as those in (5) and (6) (both internet examples), they would have to mark them as A:

(5) I think my key point is going to be this: girls are not wired to do that kind of stuff … And an even keyer point is the definition of 'stuff' (Blogzkrieg, 19 Apr 2005)

(6) Keyest paragraph in the papers … (abc News, 17 May 2005)

ICE-GB doesn't have any comparatives or superlatives of key, but BNC has one, albeit syntactic rather than morphological:

(7)
Meirion Rowlands, one of the Ashleys' most key_AJ0 appointments of this time, was well known as the local prizewinning sheep shearer; (BNC GU9 7)

The word key in (7) is duly tagged as an adjective, AJ0.

The question is, how do we get — I mean, how did the English language get — from the N of (4) to the A of (5)-(7)? There are contexts where the N~A distinction can be neutralised. The clearest case is that of pre-nominal modifier:

(8) Doctors are always key_N power brokers in the NHS <ICE-GB:W2A-014 #78:1> [FTF]

An older corpus like LOB tags pre-modifying key consistently as NN in a key ring but as JJB in a key role (strings explicitly mentioned in the User's Manual). There is an interesting discussion in the Manual of JJB, which is a subcategory of JJ, Adjective:

The tag JJB is used for adjectives which are restricted to attributive position before a noun. […] Unfortunately, the distinction between adjective and noun is not always as clear as in the examples given so far. The typical JJB form neither really satisfies the criteria for adjectives nor for nouns. The adjective tag is assigned by 'default', since attributive position is a fundamental characteristic of adjectives, while it is only one of the subsidiary positions of a noun. The tag JJB is applied (1) provided that the form is excluded from clear nominal positions, and (2) if a corresponding noun exists, provided that the attributive use is clearly predominant and/or clearly differentiated in meaning from the noun. See further the treatment of JJB in the preceding section. (Johansson 1986)

So LOB is recognising that key has developed an adjectival use but is assuming — correctly for LOB itself but wrongly for PDE — that a predicative use is missing. Notice that some of my other examples (fun, luck) retain more noun properties than key.

ANC2 comes with two alternative taggings: Biber and Penn Treebank = Hepple. In ANC2, premodifying key is tagged as JJB when using the Biber tagset but as JJ in the Penn Treebank tags (where JJB is unavailable) — and that goes as much for key rings as for key role, types which don't seem to be distinguished in ANC2.

How do BNC, ICE-GB and DCPSE tag such uses? Here are two DCPSE examples (both, as it happens, from ICE-GB) involving the string key issue(s):

(9)
I've never the thought the_ART key_N issue_N has been how do we fea defeat the poll tax <DCPSE:DI-D14/ICE-GB:S1B-034 #117:1:K> [FTF]

(10)
These become key_ADJ issues_N in which the two groups become further and further divided <DCPSE:DI-E07/ICE-GB:S1B-047 #86:1:B> [FTF]

In these sense units, the key issue in (9) is categorised as ART + N–N, with key and issue "ditto-tagged" in a way that suggests a kind of compound N, while key issues in (10) is two words, ADJ + N. The choice between the analyses seems to be fairly random (9 instances of the former to 14 of the latter in DCPSE), not obviously conditioned either by syntactic context or by judgements of degree of lexicalisation. [3] As for BNC, it has eight tags for the word key preceding a noun:

  • AJ0 (×5892)
  • AJ0-NN1 (×786), the majority of which appear to be adjectives, but many of which are not
  • NN1 (×197)
    • when part of low key and also hot key, key board, key caps, key card, key chain, key ring, key signature and many such combinations involving key in its lock (physical or electronic), keyboard and music senses, though sometimes fortuitously similar pairs were misclassified as NN1 (key change, key pattern, key system) both when they should have been and when they shouldn't:

(11)
A key_NN1 change_NN1 to previous proposals was the suggestion that the executive council be directly elected by all South African citizens. (BNC HLJ 244)

(12)
The most unusual type of key_NN1 change_NN1 which is a regular ingredient of Campra's tonal architecture is to be found in one context only. (BNC J1A 791)

  • NN1-AJ0 (×38), which include some clear adjectives
  • NP0 (×35) part of proper name
  • VVI (×4) three out of four of which weren't actually verbs

(13)
More than 275,000 recipe leaflets have been produced and mailed to key_VVI catering customers, giving eight new recipe ideas. (BNC ACR 561)

  • NN1-VVB (×1), also actually an adjective
  • UNC (×1) unclassified

For an explanation of the "ambiguity tags", BNCweb Help offers

Ambiguity tag
A set of two part-of-speech tags (joined by a hyphen) attached to a single lexical item, to indicate that the CLAWS tagging program was unable to reliably distinguish between the two possible word classes. In the BNC World Edition, the ordering of the tags is significant: it is the first of the two tags which is estimated by the tagger to be the more likely.

So there is a rather mixed approach to this word in attributive position, even if we ignore the obvious errors. Most taggers recognise that key in prenominal position can be an adjective, though there are five possible adjective-ish analyses: adjective, attributive-only adjective, uncertain noun or adjective (with priority either way), and nominal compound element. Some corpora in addition recognise that it can remain a noun in prenominal position.

Compare those morphosyntactic treatments with the lexicological approach of a historical dictionary. OED only uses the word class noun against the headword key n. 1, but near the end of the entry it recognises:

Passing into adj. in the sense of 'dominant', 'controlling', 'chief', 'essential'; esp. designating some person or thing that is of crucial importance to others. (OED 2 s.v. key n. 1 V. attrib. and Comb.17. b)

The earliest citation is from 1913, and this sense was only added in 1933 Supplement. Of course, in semantics it is perfectly normal to recognise gradual change from one sense to another: 'passing into' is a characteristic piece of OED phraseology (used 587 times, and you can find other possible N~A items among them).

A more recent development — that it is recent is clear both from the JJB coding used in LOB and the evidence of OED — is the use of key in the same sense, as a predicative complement without determiner:

(14)
Everything, he sometimes believed, could be resolved in sex: perhaps only briefly and never more than temporarily, but often the brief and temporary were key_AJ0 to any future possibility at all. (BNC FP1 1382)

There are no such examples of key in ICE-GB or DCPSE, but BNC has a number, e.g.

(15)
The agreement of a mutually acceptable reserve price is key_AJ0. (BNC HJ5 1349)

(16)
"But it sometimes helps if you remind key_AJ0 people which side their bread is buttered on, don't you think?" Owen wondered in what sense the District Chief was key_AJ0. (BNC J10 455)

They are tagged as adjectives, AJ0. ANC2 has several, also tagged as adjectives, JJ. Compare luck, which is rightly tagged as NN1 in BNC:

(17)
The clear sky was luck_NN1. (BNC ADY 2876)

Absence of a determiner evidently is not enough to have luck in (17) tagged as an adjective. Examples like the following are clinchers for the adjectival analysis of predicative key, because the word key is itself premodified in a way only possible for adjectives:

(18)
There are a number of reasons why people lose their hair, stress is a very_AV0 key_AJ0 factor. (BNC HVE 174)

(19)
I think this is so_RB key_JJ. I mean, it's what every study has ACTUALLY found (ANC2 PXNatter07-3 1359) [Hepple]

(For younger speakers, though, the so of (19) may not be an infallible test of adjective-hood.)

2.2 Case study 2: A ~ D

In a recent paper (Denison 2006) I have argued that a group of adjectives have been developing determiner uses. I mentioned divers(e), several, certain, various. I will ignore the first of these here. The partitive construction is recognised in the Cambridge Grammar (Huddleston & Pullum 2002: 538-540) as providing evidence of determiner status:

(20)
certain_jj of these countries - most notably Korea, Japan, and Taiwan - relied… (ANC2 ArticleIP_2637 82) [Biber]

certain_JJ of these countries - most notably Korea, Japan, and Taiwan - relied… (ANC2 ArticleIP_2637 82) [Hepple]

Several is almost entirely a quantifier now and is very rare in predicative or postdeterminer roles. Certain retains as its principal role a central adjectival meaning 'sure' and furthermore can be complemented in that usage by a non-partitive of -phrase:

(21)
a man fairly certain_jj of his place in the world (ANC2 NYT20020731.0102 154) [Biber]

a man fairly certain_JJ of his place in the world (ANC2 NYT20020731.0102 154) [Hepple]

That would presumably make it harder for a tagger to distinguish the determiner from the descriptive use. Various is least far along the road to determinerhood, and the partitive pattern only appears to date from the mid-nineteenth century and in the twentieth century to be more American than British.

What do the corpora do? In BNC several is always[?] a determiner, DT0:

(22) several_DT0 books have already appeared. (BNC A04 111)

(23) Several_DT0 of our brightest chefs (BNC A0C 793)

On the other hand, certain and various seem to be adjectives (AJ0) throughout, including in partitive use. In DCPSE several seems to be tagged as pronoun both in the partitive pattern, when it is a noun phrase head, but even when premodifying, though then its function is as determiner. Certain is treated more subtly: an adjective in clear cases like (24) with the meaning 'sure' but also in more quantifier-like uses like (25). In the partitive (26) it is a pronoun.

(24) I'm quite certain_ADJ that they do <DCPSE:DL-B01/LLC:S-01-01 #408:1:B>

(25) and therefore certain_ADJ people concluded … that <DCPSE:DL-E01/LLC:S-06-01 #240:2:B>

(26) there's no <,> doubt about certain_PRON of his abilities <DCPSE:DL-B36/LLC:S-05-11 #167:1:B>

Various is always ADJ, but there are no various of strings in the corpus. I wonder how it would have been tagged if found.

In ANC2, several is an adjective for the Penn Treebank tagging, JJ, as are various and certain, all of which occur partitively (cf. (20) above):

(27) various_JJ of the actors (ANC2 PXAngel02-8 1373) [Hepple]

The Biber tagging is different for several, where it is generally a post-determiner (ap), apart from one example — showing signs of textual corruption — which was tagged as an adjective with attributive function (jj+atrb):

(28)
Any of the several_jj+atrb luxury hotels tucked just behind the dunes (ANC2 Bahamas-WhereToGo 80) [Biber]

Any of the several_JJ luxury hotels tucked just behind the dunes (ANC2 Bahamas-WhereToGo 80) [Hepple]

Various and certain, however, were adjectives in the Biber tagging too.

In sum, then, we have three words, historically part of the same long-term trend, which receive very different treatments in the various corpora under consideration. Various — latest to start moving towards having some determiner properties — is always an adjective. Certain is an adjective for ANC2 and BNC, while the DCPSE tagging recognises it as a pronoun when in partitive use, therefore determiner-like. Several is an adjective for Penn Treebank tagging, a post-determiner for Biber tagging, a determiner for BNC, and a pronoun for DCPSE.

2.3 Case study 3: V ~ P

It is well known that infinitival perfect have is often reduced phonetically to [əv] or [ə]. That in itself is not a problem of category: it can remain an auxiliary of the perfect. However, three further possibilities cast doubt on its category status:

  • the pronunciation [ɒv], spelling <of> (though only when that spelling isn't merely eye dialect to indicate a non-standard speaker)
  • an additional occurrence, whether phonetically reduced or not, in a position not sanctioned by standard grammar

(29)
If I hadn't have_VB known, I would have been fine (ANC2 ChapmanDebbie 2005) [Hepple]

(30)
"I wish we hadn't have_vb+hv+aux done it…" (ANC2 NYT20020707.0164 193) [Biber]

"I wish we hadn't have_VB done it…" (ANC2 NYT20020707.0164 193) [Hepple]

  • positional behaviour which suggests that the form is behaving as an enclitic on the preceding verb

(31)
stuff i would 've_vb+hv+aux++0 never thought to (ANC2 adv700ju047 3109) [Biber]

stuff i would 've_VB never thought to (ANC2 adv700ju047 3109) [Hepple]

(32)
i would 've_vb+hv+aux++0 never had any problems (ANC2 sw2811-ms98-a-trans 432) [Biber]

i would 've_VB never had any problems (ANC2 sw2811-ms98-a-trans 432) [Hepple]

The examples in (31)-(32) are indicators of enclitic status, if not as strikingly non-standard as some of those found by Boyland (1998). There is a good case for treating 've not as a verb but as an invariant particle within the verbal group indicating non-fulfilment or unreality (Denison 1998: 140-142, 210-212) .

ANC2 has a number of examples of modal + of + past participle, (33)-(35), with the past participle sometimes miscoded as a past tense, (36)-(38):

(33)
After killing her, the convict remarks, "She would of_in been a good woman, if it had been somebody there to shoot her every minute of her life." [explicitly quoted from Flannery O'Connor's story “A Good Man Is Hard To Find”] (ANC2 ArticleIP_68014) [Biber]

After killing her, the convict remarks, "She would of_IN been a good woman, if it had been somebody there to shoot her every minute of her life." (ANC2 ArticleIP_68014) [Hepple]

(34)
he he said he didn't think it should of_in gotten all those awards he thought it was too long (ANC2 sw2078-ms98-a-trans) [Biber]

he he said he didn't think it should of_IN gotten all those awards he thought it was too long (ANC2 sw2078-ms98-a-trans) [Hepple]

(35)
well they converted all the road signs to fifty five miles an hour you know they could of_in converted it to metric just about as easy (ANC2 sw2669-ms98-a-trans) [Biber]

well they converted all the road signs to fifty five miles an hour you know they could of_IN converted it to metric just about as easy (ANC2 sw2669-ms98-a-trans) [Hepple]

(36)
they could of_in had a lottery (ANC2 sw2335-ms98-a-trans) [Biber]

they could of_IN had a lottery (ANC2 sw2335-ms98-a-trans) [Hepple]

(37)
i i think i could of_in handled a lot of the work that i've been doing here in college at a much younger age (ANC2 sw4084-ms98-a-trans) [Biber]

i i think i could of_IN handled a lot of the work that i've been doing here in college at a much younger age (ANC2 sw4084-ms98-a-trans) [Hepple]

(38)
if it was some you know coming into the store and stuff then she would of_in had whoever made the purchase would have had sign some kind of um you know document and uh who (ANC2 sw4834-ms98-a-trans) [Biber]

if it was some you know coming into the store and stuff then she would of_IN had whoever made the purchase would have had sign some kind of um you know document and uh who (ANC2 sw4834-ms98-a-trans) [Hepple]

(39)
the solecism like to of, as in The boy like to of_IN killed hisself, is labeled as being limited to a region of Texas. (ANC2 VOL18_4) [Hepple]

(40)
but soon as i came down here the heat was must must 've_vb+hv+aux++0 been too intense for the car so the car kept overheating and turned out it was some kind of a problem with the radiator (ANC2 sw3742-ms98-a-trans) [Biber]

but soon as i came down here the heat was must must 've_VB been too intense for the car so the car kept overheating and turned out it was some kind of a problem with the radiator (ANC2 sw3742-ms98-a-trans) [Hepple]

(41)
How embarrassing could that 've_VB been? (ANC2 PXAngel01-8) [Hepple]

(42)
i think he ought to should 've_vb+hv+aux++0 gone in there and blew them away (ANC2 sw2587-ms98-a-trans) [Biber]

i think he ought to should 've_VB gone in there and blew them away (ANC2 sw2587-ms98-a-trans) [Hepple]

The word of is tagged as a preposition (IN) in all these examples, but <'ve> is tagged as a base form verb (VB), as is the superfluous have in all five examples like (29)-(30). [4] For the reduction to schwa, ANC2 appears to transcribe <a> as part of the preceding word, hence coulda, woulda, shoulda, mighta, musta, oughta, …, which are tagged rather unreliably. Four are tagged correctly as two syntactic words, modal + base form (MD|VB), though the following verb is incorrectly tagged as a past tense:

(43)
well you mighta_MD|VB used_VBD the small angle formula. (ANC2 tut150mu042) [Hepple]

but one is tagged as an adjective (JJ), and fully 85 are tagged as if they were nouns (NN):

(44) I totally shoulda_NN took_VBD the road (ANC2 PXAngel04-4) [Hepple]

In BNC there are many more examples of <of> for have, about 1633 out of 1746 tagged as the infinitive of have (VH1). The remainder are tagged as the preposition of (PRF). The contraction <'ve> is always tagged VH1. Just as with ANC2, the one-word spellings mighta, coulda, oughta, shoulda, woulda, musta tend to get tagged as nouns (NN1). According to information in the web page, the fused form <'d've> is always tagged as VM0 + VH1, that is to say, as modal + have.

In DCPSE, <'ve> is always[?] AUX in relevant structures, the same as the full form <have>. I have not found examples of <a> or <of> for have in DCPSE or ICE-GB.

In short, reduced forms are either tagged as if they were not reduced, or they are completely mistagged. There is no apparent recognition of a category change in progress.

2.4 Case study 4: N+P ~ D ~ Adv

The problem here is different again, because we have lexicalisation of a phrase, reanalysis of structure, yet normal orthography represents the elements as separate words. The problem concerns sort of, kind of and to some extent type of. I will stick to sort of. In the analysis of Evelien Keizer (2001) there are three main uses of sort of before a nominal expression:

  • the binominal construction:

(45)
Davis was the_AT0 worst_AJS sort_NN1 of_PRF person_NN1 to have at the head of a monopolistic company. (BNC A7L 982)

  • the qualifying construction:

(46)
there was a_AT0 sort_NN1 of_PRF emptiness_NN1 inside her (BNC EWH 1163)

  • the postdeterminer construction:

(47)
Those_DT0 sort_NN1 of_PRF jobs_NN2 just don't exist for people like you and me. (BNC A0F 424)

But sort of may precede non-nominal phrases or nothing at all, in which case it is adverbial:

(48)
If Andy thought a drawing by Paul Klee looked "sort of_AV0 funny_AJ0", he said so. (BNC A04 1580)

(49)
Well, if you were to sort of_AV0 pop_VVI off_AVP, I'd go on being a countess, wouldn't I? (BNC A0D 122)

(50)
but its demountable engine drive system made it workable, sort of_AV0. (BNC AJY 1512)

In an unpublished paper (Denison 2002) I suggested that the adverbial construction inherited its properties from the post-determiner and the qualifying constructions:

 

In ANC2, the of of sort of is always tagged as a preposition (IN), [5] while sort is nearly always a noun (NN). However, some 70 examples of sort are tagged otherwise, either as base form verb (VB) or as adverb (RB). It looks as if the tagger has been confused by the structures involved, especially as hardly any of them have a noun or NP after the of. It could be imagined that the adverb tag is deliberate, but against this must go two facts: the preposition tag on of, and the fact that thousands of adverbial sort ofs are tagged with sort as noun. ANC2, then, essentially treats sort and of as separate items in their historical categorisations.

In BNC, there are 22,754 instances of the string sort of. 16,964 are tagged with sort as a noun (NN1) and of as a preposition with its own unique tag (PRF); in one further example, of is unclassified (UNC), the tagger perhaps confused by a dash. However, BNC clearly does recognise lexicalisation, since 5,774 are tagged as a multiword unit, with nothing after sort and the tag for general adverb after of (AV0). The latter tagging seems to be used in a range of contexts, including when sort of is sentence-final or part of a determiner-less predicative phrase or when it precedes an adjective or verb. Just fifteen seem to be mistagged: 12 with sort as a verb (VV1), 1 with the phrase as a multiword unit that is a noun (NN1), and — interestingly — 2 with sort as noun and of as a form of have (VH1):

(51)
Well, you know <pause> I just can't sort_NN1 of_VHI let_VVN it go on. (BNC KCT 9825)

(52)
Erm <pause> because I would_VM0 n't_XX0 sort_NN1 of_VHI finished_VVN till <pause> quarter pa— (BNC KE2 8717)

In (51) the tagger has taken let as a past participle. In the case of (52) the tagger can well be excused: ideally, if this spoken utterance has been transcribed correctly, we would want of here to form simultaneously part of sort of and part of the verbal group.

To distinguish between the two main taggings — noun + preposition and multiword — how does the tagger decide? It looks as if it goes for two separate words at least whenever sort of occurs in an NP with a determiner and optionally adjectives to the left. (These are my suppositions only.) Does it get the multiword contexts correctly? When BNC identifies sort of as two words, some examples are clearly ambiguous:

(53)
People love to be awed when they entera pub by a superior natural force -- a strange sort_NN1 of_PRF higher masochism. (BNC A0B 38)

(54)
Very, very pretty lady, all that sort_NN1 of_PRF beastly foreign stuff. (BNC A0D 194)

(55)
His "peculiar gloating obsequious humour", his "sort_NN1 of_PRF capricious self-satisfaction" lurking in the very midst of "plaintive protestations", are described and pondered. (BNC A18 1463)

But many are not:

(56)
"He gave a sob, and she went rattling down the stairs to her room the way she always did and then I heard that awful sort_NN1 of_PRF slither and Bunty's scream ..." (BNC A0D 1432)

(57)
"Sort_NN1 of_PRF clay-y like." (BNC A0D 2215)

(58)
It's getting sort_NN1 of_PRF light now. (BNC A74 2371)

So BNC essentially distinguishes the binominal from lexicalised forms, if not altogether successfully, apparently lumping together the qualifying, postdeterminer and adverbial constructions (though 18/274 of the latter are tagged as noun + preposition). This is a good start, but even if it had been brought to perfection, it would not have sufficed. To give one example of the problem that would remain, consider sort of thing, of which there over 2,100 instances in the BNC, proving that this string is at the very least a frequent collocation. Some can just be treated as containing binominal sort of:

(59)
He was wearing some kind of rock "n" roll suit, the sort_NN1 of_PRF thing_NN1 that Jerry Lee Lewis might have worn -- (BNC A6E 78)

But some can't, because sort of thing is itself beginning to be lexicalised:

(60)
I really thought it was going to be sort of "You must not drink ever again" and "You naughty boy, you mustn't do it ever again" sort_NN1 of_PRF thing_NN1. (BNC ALP 210)

(61) I was there, sort_NN1 of_PRF thing_NN1. (BNC ANY 465)

And the same goes for a number of overlapping collocations within the sort of family, including those sort of, those sort, what sort, some sort of, that sort of thing, etc.

What about DCPSE? It certainly recognises the adverbial lexicalisation of sort of, assigning the ditto-tag ADV to both elements, either as intensifying or as general adverb.

(62) I'm just sort_ADV of_ADV looking <DCPSE:DI-A06/ICE-GB:S1A-033 #107:1:A> [FTF]

(Once the ditto-tagging is as adjective, once as preposition, and once as PRON in the phrase a sort of couple.) Separate tagging as noun + preposition can occur widely, including for postdeterminer use:

(63)
Well I'd actually expect that those_PRON sort_N of_PREP courses are very uh heavily subscribed <DCPSE:DI-A06/ICE-GB:S1A-033 #128:1:B> [FTF]

(I should point out that Evelien Keizer and I are unsure about the correct analysis of so-called post-determiner examples.) There is more subtlety in the treatment of sort of thing, but not perhaps consistency:

(64)
I mean they've really become sort_ADV of_ADV about my family sort_ADV of_ADV thing_ADV <DCPSE:DI-A09/ICE-GB:S1A-050 #43:1:B> [FTF]

(65) Is that the Tefl sort_N of_PREP thing_N <DCPSE:DI-A08/ICE-GB:S1A-035 #13:1:B> [FTF]

(66) I realize he's applied before sort_FRM of_FRM thing_FRM <,> <DCPSE:DL-B18/LLC:S-02-06 #7:1:A>

In example (64), sort of is twice ditto-tagged as ADV, and in sort of thing it is said to be followed by another adverb. In (65), the ditto-tagging links the two nouns Tefl and sort, while of is a preposition and thing a noun. In (66) all three members of sort of thing are ditto-tagged as a 'formulaic expression' (FRM) functioning as a discourse marker: Clearly sort of is a nightmare for tagging.

2.5 Case study 5: P ~ infinitive marker ~ M

In ANC2, to as infinitive marker has a tag of its own, TO (both Penn and Biber tagsets), likewise BNC (TO0) and DCPSE (TO): it is in effect what Pullum (1982) calls "syncategoremic". One can't complain about this, as it's certainly easy to find, but it means that Pullum's generalisation about to and the modals being able to precede an ellipsis site is lost, and searches for such contexts need a disjunction. Note that ellipsis after to is first found from the end of the eighteenth century and is rare before the mid-nineteenth (Denison 1998: 201-202). More interesting is the relationship between infinitival to and the form which belongs to certain prepositional verbs with a verbal complement:

(67) He planned to retire then.

(68) He objected to retiring then.

(69) No objection to retirement.

Until recently in late ModE, but not generally in PDE, many verbs such as object in (68) were followed by an infinitive rather than an -ing, so there are interesting things going on here. The to of (68) and/or of (69) may be tagged as a preposition or as the same to as is found with infinitives (thus ANC2, at least after object): either way, a generalisation is being lost.

3. Conclusion

In making this brief survey I have found in all of my five case studies that words which are in process of changing category have proved difficult for (semi-)automatic taggers to handle, though the subtlety with which such items are handled does seem to vary quite considerably. In part this is because the very tagsets at the taggers' disposition are rather varied. Note that all the corpora discussed go way beyond the traditional parts of speech (which can be counted on the fingers of two hands):

The POS tagsets used to annotate large corpora in the past have traditionally been fairly extensive. The pioneering Brown Corpus distinguishes 87 simple tags (Francis 1964), (Francis & Kucera 1968) and allows the formation of compound tags; thus, the contraction I'm is tagged as PPSS+BEM (PPSS for "non-3rd person nominative personal pronoun" and BEM for "am, 'm". Subsequent projects have tended to elaborate the Brown Corpus tagset. For instance, the Lancaster-Oslo/Bergen (LOB) Corpus uses about 135 tags, the Lancaster UCREL group about 165 tags, and the London-Lund Corpus of Spoken English 197 tags. A useful overview of the relation of these and other tagsets to each other and to the Brown Corpus tagset is given in Appendix B of Garside et al 1987. The rationale behind developing such large, richly articulated tagsets is to approach "the ideal of providing distinct codings for all classes of words having distinct grammatical behaviour" (Garside, Leech & Sampson 1987). (Marcus, Santorini & Marcinkiewicz 1993: 314)

In part the variation among cases and corpora is due to differences in the way indeterminacy is handled:

A final difference between the Penn Treebank tagset and all other tagsets we are aware of concerns the issue of indeterminacy: both POS ambiguity in the text and annotator uncertainty. In many cases, POS ambiguity can be resolved with reference to the linguistic context. So, for instance, in Katherine Hepburn's witty line Grant can be outspoken — but not by anyone I know, the presence of the by-phrase forces us to consider outspoken as the past participle of a transitive derivative of speak — outspeak — rather than as the adjective outspoken. However, even given explicit criteria for assigning POS tags to potentially ambiguous words, it is not always possible to assign a unique tag to a word with confidence. Since a major concern of the Treebank is to avoid requiring annotators to make arbitrary decisions, we allow words to be associated with more than one POS tag. Such multiple tagging indicates either that the word's part of speech simply cannot be decided or that the annotator is unsure which of the alternative tags is the correct one. In principle, annotators can tag a word with any number of tags, but in practice, multiple tags are restricted to a small number of recurring two-tag combinations: JJ|NN (adjective or noun as prenominal modifier), JJ|VBG (adjective or gerund/present participle), JJ|VBN (adjective or past participle), NN|VBG (noun or gerund), and RB|RP (adverb or particle). (Marcus, Santorini & Marcinkiewicz 1993: 316)

My purpose in making this investigation is not in any way to criticise the compilers of the corpora I happened to look at. Far from it: their efforts have enormously enriched the range of linguistic research which can be carried out. I suspect that among the inconsistencies which have emerged, some are an irreducible consequence of a mismatch between language as actually used and a widely-held conception of grammar, namely that every word in every sentence belongs to exactly one category — no more and no less — selected from a small number of parts of speech. This assumption is common both to traditional grammar and to most modern theoretical approaches. Presumably, it is the working assumption behind POS tagging. [6] I suspect strongly, however, that it does not correspond to actual language behaviour and is in consequence unworkable. That last point is, of course, an exaggeration, since categories — even if epiphenomenal — are more often than not quite straightforward to assign, and categorisation can be very useful in practice. Rather, the assumption of universal and unique lexical categoriality is a practical oversimplification of reality which works tolerably well quite a lot of the time — but not in such cases as I have chosen to concentrate on. Note, however, that such cases are not really at the margins of language use: they are a common part of everyday colloquial English.

At this stage in my work I do not have any firm conclusions to draw, so I content myself with some brief observations and questions, taking as my focus the related areas of category change (diachrony) and category indeterminacy (synchrony).

What should ideally be shown in the grammatical mark-up? Even without considering structural parsing, being able to consult both category and function labelling has clear advantages, though admittedly in cases of doubt as to the construction involved, it seems that category and function labels usually go together in practice.

Is it possible to tag (parse?) by construction rather than (just) by word? That is an attractive idea, though the technical challenges might be quite daunting. Even if we confine ourselves to words and lexical items (a level which can extend beyond the word but not in any real sense to constructions), indeterminacy tags could be used more widely than they are.

However, there will always be tension between having too little information and too much. The greater the choice of tags and also the greater the willingness to recognise indeterminacy, the more information can be provided. Whether this is a good thing or not is not so clear. It is in some ways easier to construct searches if the mark-up possibilities are more limited, even if this increases the number of mistagged items. It depends whether it is felt that the point of tagging is purely to aid in the collection of examples — examples which are then examined 'manually' and reclassified before being used for linguistic research — or whether the tags are so 'sticky' that they actually prevent recognition of category change.

How flexible are tagging programs? Consider this. A transition from N to A is widely recognised for key, but not yet for draft. When taggers first come across strings like very draft in the data fed to them, will they correctly tag draft there as an adjective? And — this is crucial — will they then be able to go back and reconsider cases like a draft manifesto, whose status is in my opinion changed by the availability of very draft? In this connection, stand-off tagging is attractive in its non-finality: different taggings can be applied easily to the same textual data.

As a final question, could we actually reveal language change by tagging procedures, rather than merely playing catch-up after the event? Perhaps we can, but only if

  • Diachronic linguists produce a typology of systematic category changes.
  • Synchronic linguists build such possibilities into the lexicons and taggers that they use.

Notes

[1] Thanks to Bas Aarts, Sean Wallis, Sebastian Hoffmann, Nancy Ide, Keith Suderman for advance versions of corpora and/or help with corpus software. No blame attaches to them.

BNC was searched with the BNCweb software and briefly, a beta version of the new BNCweb. For ANC2, the stand-off tagging was inserted into the text files with ANCTool, searching was done with BareGREP and the results exported to MonoConc Pro. For the two UCL/SEU corpora I had access to the latest version of the ICECUP software (as at March 2006). For LOB I used MonoConc Pro.

[2] Examples from ICE-GB include a link to an image showing the relevant Fuzzy Tree Fragment (FTF).

[3] N-N: the key words, the key element, a key hold, the key issue (× 3), key manoeuvres, the key statistics, a key figure.

ADJ + N: the key figure, the kind of key currency, certain key moments, the key element, key assumptions, any key details, key decisions, the key institution, key issues, the key values, a key part, the key people, a key role, key services and commodities.

[4] In the Penn/Hepple tags but not the Biber tags, IN could also represent a subordinating conjunction.

[5] In three cases of is tagged as an adjective (JJ) because sort is preceded by a dash mistaken for a hyphen and so the whole sequence is completely misidentified.

[6] See here an insightful paper by Stig Johansson (1985) — which he kindly brought to my attention when the present paper was presented — on the post-editing of the tagged LOB Corpus. Some of the kinds of problem raised here are anticipated in Johansson's discussion of automatic tagging errors and of ambiguity, merger and gradience in synchronic language use.

Sources

ANC2 = American National Corpus, second edition, http://americannationalcorpus.org/

BareGREP software, http://www.baremetalsoft.com/baregrep/

BNC = British National Corpus, http://www.natcorp.ox.ac.uk/

DCPSE = Diachronic Corpus of Present-Day Spoken English, http://www.ucl.ac.uk/english-usage/projects/dcpse/

ICE-GB = British Component of the International Corpus of English, http://www.ucl.ac.uk/english-usage/projects/ice-gb/

LOB = Lancaster-Oslo-Bergen Corpus, http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#lob

  • The tagged LOB corpus: Users' manual by Stig Johansson, in collaboration with Eric Atwell, Roger Garside and Geoffrey Leech. 1986. Bergen: Norwegian Computing Centre for the Humanities. http://khnt.hit.uib.no/icame/manuals/lobman/

MonoConc Pro concordancer, published by Athelstan, http://athel.com/

Oxford English Dictionary, http://www.oed.com/

References

Boyland, Joyce Tang. 1998. "A corpus study of would + have + past-participle". Historical Linguistics 1995: Selected Papers from the 12th International Conference on Historical Linguistics, Manchester, August 1995, vol. 2, Germanic, ed. by Richard M. Hogg & Linda van Bergen, 1-17. (Current Issues in Linguistic Theory 162.) Amsterdam and Philadelphia: John Benjamins.

Denison, David. 1998. "Syntax". The Cambridge History of the English Language, vol. 4, 1776-1997, ed. by Suzanne Romaine, 92-329. Cambridge: Cambridge University Press.

Denison, David. 2001. "Gradience and linguistic change". Historical Linguistics 1999: Selected Papers from the 14th International Conference on Historical Linguistics, Vancouver, 9-13 August 1999, ed. by Laurel J. Brinton, 119-144. (Current Issues in Linguistic Theory 215.) Amsterdam and Philadelphia PA: John Benjamins.

Denison, David. 2002. "History of the sort of construction family". Paper presented at ICCG2: Second International Conference on Construction Grammar, Helsinki. http://www.llc.manchester.ac.uk/subjects/lel/staff/david-denison/papers/thefile,100126,en.pdf

Denison, David. 2006. "Category change and gradience in the determiner system". The Handbook of the History of English, ed. by Ans van Kemenade & Bettelou Los, 279-304. (Blackwell Handbooks in Linguistics.) Oxford: Blackwell.

Denison, David. in preparation. English Word Classes: Categories and Their Limits. (Cambridge Studies in Linguistics.) Cambridge: Cambridge University Press.

Francis, W. Nelson. 1964. A standard sample of present-day English for use with digital computers: Report to the U.S Office of Education on Cooperative Research Project No. E--007. Providence, R. I.: Brown University.

Francis, W. Nelson & Henry Kucera. 1968. Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin.

Garside, Roger, Geoffrey Leech & Geoffrey Sampson. 1987. The Computational Analysis of English: A Corpus-based Approach. London: Longman.

Huddleston, Rodney & Geoffrey K. Pullum. 2002. The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press.

Johansson, Stig. 1985. "Grammatical tagging and total accountability". Papers on Language and Literature: Presented to Alvar Ellegård and Erik Frykman, ed. by Sven Bäckman & Göran Kjellmer, 208-220. (Gothenburg Studies in English 60.) Gothenburg: Acta Universitatis Gothoburgensis.

Keizer, Evelien. 2001. "A classification of sort/kind/type-constructions". Ms., University College London.

Leech, Geoffrey & Lu Li. 1995. "Indeterminacy between Noun Phrases and Adjective Phrases as complements of the English verb". The Verb in Contemporary English: Theory and Description, ed. by Bas Aarts & Charles F. Meyer, 183-202. Cambridge: Cambridge University Press.

Marcus, Mitchell P., Beatrice Santorini & Mary Ann Marcinkiewicz. 1993. "Building a large annotated corpus of English: The Penn Treebank". Computational Linguistics 19: 313-330.

New English Dictionary on Historical Principles. 1888-1933. Ed. by James A. H. Murray. Oxford: Clarendon Press.

Pullum, Geoffrey K. 1982. "Syncategorematicity and English infinitival to". Glossa 16: 181-215.