Background and history

The Brown University Standard Corpus of Present-Day American English, compiled in the 1960s, was the first publicly available electronic corpus. Its compilation took place "in the face of massive indifference if not outright hostility from those who espoused the conventional wisdom of the new and increasingly dominant paradigm in US linguistics led by Noam Chomsky" (Kennedy 1998: 23). Henry Kučera (1992: 402) describes the reactions he and W. Nelson Francis, the compilers of the Brown Corpus, had to come to terms with:

the prevalent linguistic fashions of the early 1960s were hardly favorable, at least in the United States, to any enterprise that included an examination and analysis of actual language data. The goal then was "to capture", to use the favorite verb of that age, various profound generalizations about the competence of an ideal speaker-listener who, we were instructed, knew his or her language perfectly, had no memory limitations, lived in a completely homogeneous society, and suffered from no distractions, including demands of style or effective communication; all of this inquiry was to be pursued with the ultimate aim, achieved only perhaps in the following millennium, of discovering the basis of a universal grammar by the application of superior reasoning. Collecting empirical data was thus not considered a worthwhile enterprise in the circles of true believers since, as many of our colleagues from Boston and its suburbs so firmly impressed on us on every suitable or unsuitable occasion, a native speaker of English, for example, could provide the linguist in five minutes with a much greater amount of useful information than even a corpus of a billion words could, if one actually existed. The use of computers to discover anything significant about language only increased the severity of our betrayal. (Kučera 1992: 402.)

Computers were still in their infancy and raised suspicions among humanist academics:

There were many members of the humanistic world in various academic institutions, including Brown University, who had a predictable fear of the new "calculating machines" and little more than contempt for those among us who dared to commit the treason of joining the scientists' camp of vacuum tubes, relays and binary numbers. So, from both sides, we were certainly not spared the labels of word-counting fools, and the predictions were boldly made that we would, at best, turn into bad statisticians and intellectual mechanics. (Kučera 1992: 402-403.)

The Brown Corpus consists of a selection of texts published in the United States in 1961 and the first version of it was available on computer tape already in 1964. Compared to later corpora, the Brown Corpus is relatively small, mainly because compiling a corpus with the technology available at the time was very laborious – all the texts, for example "had to be keyed in by hand, a process requiring a tremendous amount of very tedious and time-consuming typing" (Meyer 2002: 32). The corpus was originally recorded on 100,000 punch-cards with "70 characters per line plus locational information identifying the texts and line numbers" (Kennedy 1998: 27). The cards were later transferred to magnetic tape and more recently to CD-ROM. In addition to the original, the corpus is available in six versions, some of them grammatically tagged.

Note also especially the account by W. Nelson Francis on the completion and the tagging of the Brown Corpus in the two papers listed below (Francis 1979, Francis 1980).

Image of Henry Kučera [16] and W. Nelson Francis [7] at the 1991 Nobel symposium.


Francis, W.N. 1979. Problems of assembling and computerizing large corpora. Revised version of Francis (1975). In Bergenholtz & Schaeder (eds) 1979: 110-123. Reprinted in Johansson (ed) 1982: 7-24.

Francis, W.N. 1980. A tagged corpus: Problems and prospects. Studies in English linguistics for Randolph Quirk, ed. S. Greenbaum, G. Leech, & J. Svartvik, 192-209. London: Longman.

Kennedy, Graeme. 1998. An Introduction to Corpus Linguistics. London & New York: Longman.

Kučera, Henry. 1992. "The odd couple: The linguist and the software engineer. The struggle for high quality computerized language aids." In Svartvik, Jan (ed.), Directions in Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991. (Trends in Linguistics. Studies and Monographs 65.) Berlin: Mouton de Gruyter. 401-420.

Meyer, Charles F. 2002. English Corpus Linguistics. An introduction. Cambridge: Cambridge University Press.