British Academic Spoken English Corpus (BASE)

The BASE corpus was developed by Hilary Nesi, with Paul Thompson. Natalie Snodgrass and Sarah Creer were employed as research assistants and Tim Kelly was video director for the project. Lou Burnard (Oxford University) and Adam Kilgarriff (Lexicography MasterClass Ltd) acted as consultants. The corpus facilitates, amongst other things, investigation of:

  • The frequency and range of academic lexis
  • The meaning and use of individual words and multi-word units
  • The structure of academic lectures
  • The pace, density and delivery styles of academic lectures
  • The discourse function of intonation
  • Patterns of interaction, including turn-taking and topic selection
  • The interplay of visual and aural stimuli
  • The representation of ideas and the expression of attitudes

The lectures and seminars have been transcribed and tagged using a system devised in accordance with the TEI Guidelines. The corpus has been deposited in the Oxford Text Archive and is catalogued by the Arts and Humanities Data Service

Project leader: Hilary Nesi, Paul Thompson
Time of compilation: 2000-2005
Size:1,644,942 tokens, 160 lectures and 39 seminars
Period: 21st c English
Project home page:

Funding: The early stages of corpus development were assisted by funding from the Universities of Warwick and Reading , BALEAP, EURALEX, and The British Academy (2000-2001, Grant reference: SG 30284).

Major funding was provided by the Arts and Humanities Research Council as part of their Resource Enhancement Scheme (2001–2005, Award Number: RE/AN6806/APN13545).





Hilary Nesi

Paul Thompson


Holdings are distributed across four broad disciplinary groups (Arts and Humanities, Social Sciences, Physical Sciences and Life Sciences), each represented by 40 lectures and 10 seminars. The lectures and seminars have been transcribed and tagged using a system devised in accordance with TEI Guideline. Click here for the corpus manual and the BASE Corpus Holdings Spreadsheet.

The corpus can be downloaded from the Oxford Text Archive (resource number 2525). It is also available via the Sketch Engine corpus query tool, by subscription or open-access ( )

The Wordtree provides an open-access visualisation tool.

Interview notes, vocabulary lists and video/audio files can be found at

Reference line and copyright

The British Academic Spoken English (BASE) corpus is freely available to researchers who agree to the following conditions:

  1. Corpus holdings should not be reproduced in full for a wider audience/readership (ie for publication or for teaching purposes), although researchers are free to quote short passages of text up to 100 running words, with a total of 200 running words from any given speech event.
  2. No part of the corpus holdings should be reproduced in teaching materials intended for publication (in print or via the internet).
  3. The corpus developers should be informed of all presentations and publications arising from analysis of the corpus.

Researchers must aknowledge their use of the BASE corpus using the following form of words:

The recordings and transcriptions used in this study come from the British Academic Spoken English (BASE) corpus. The corpus was developed at the Universities of Warwick and Reading under the directorship of Hilary Nesi and Paul Thompson. Corpus development was assisted by funding from BALEP, EURALEX, the British Academy and the Arts and Humanities Research Council.

Files should be referred to by their letter and number codes, indicating disciplinary grouping (eg. ah = arts and humanities). type of speech event (eg, Ict + lecture) and the file number.

When referring to the BASE corpus in your presentations and publications it is easiest to cite an original publication which describes the project. We recommend: Thompson, P. and Nesi, H. (2001) The British Academic Spoken English (BASE) Corpus Project. Language Teaching Research 5 (3) 263-264