The Compilation of the Helsinki Corpus: Tracing Early Modern English Texts
The availability of texts that could be excerpted for the Helsinki Corpus varied a great deal according to the period. All Old English and Middle English materials go back to manuscripts, and it was clear that the amount of data from these periods was far more limited than the multitude of books that were created after the advent of the printing press. The process of selection of Early Modern English texts involved checking the suitability of hundreds of books in various libraries.
In the 1980s, the technological conditions for corpus compilation were very different from those of today. Neither the World Wide Web nor digitized materials were yet in existence. In libraries, the main search 'tools' were boxes of cards, catalogues and printed bibliographies. Books were ordered by filling in order slips. Photocopying was available, but rare books and books in poor condition could not be photocopied. Some materials were microfilmed, but making prints from them was a complicated task.
In the compilation of the Helsinki Corpus, the division of labour followed the fields of interest of the team members. From the outset, it was clear that Terttu Nevalainen and I should concentrate on Early Modern English, as both of us were preparing our doctoral dissertations on this period. I was a newcomer in the field, but Terttu had already done a great deal of work on early modern texts, spending a year in London in search of materials for her dissertation. Terttu's command of the textual varieties was significant for the development of the corpus structure, complementing Matti Rissanen's broad acquaintance with the writings from this period.
|Terttu Nevalainen and Helena Raumolin-Brunberg
The structure of the corpus in terms of genres and subperiods and the inclusion of individual texts were discussed in regular team meetings, whose participants included Anneli Meurman-Solin, Merja Kytö, and occasionally Ritva Tiusanen, in addition to Terttu, Matti and myself. Anneli and Merja were compiling their own corpora on Early Scots (Anneli) and Early American English (Merja), and the aim was to synchronize the structures of all three corpora. The generic structure of the corpus developed gradually, ultimately resulting in fifteen 'text types' (the term we used for the classification of texts on the basis of external criteria, roughly corresponding to the more common term 'genre' today) for Early Modern English. Chronological continuity with the earlier sections was naturally taken into account in this decision.
It was clear from the beginning that only part of the material was available in the University and English Department libraries in Helsinki. The remedy for this problem was the excellent interlibrary-loan service at the University Library, in particular the ever-helpful librarian Liisa Koski. She acquired books for us from all over the world, some of which unfortunately had to be discarded immediately because of their modernized spelling or the lack of information about their origin. Luckily, we were able to make some visits to the British Library, and took advantage of every minute of the the opening hours to examine potential corpus materials. One of the most important 'tools' in the corpus work was the photocopier. In Helsinki, copying was easy, as we could use the machines ourselves, but in the British Library the copies had to be ordered and could not always be delivered until the following day. If the book was in poor condition it could not be copied at all and had to be discarded for this reason.
The number of samples per genre and the size of each sample were important decisions to make. For most genres we included two texts per genre for each of the three subperiods, with the aim of concentrating on as contemporary material as possible in order to create synchrony within diachrony. For some genres, it seemed appropriate to divide the genre internally into two types; for example, biography texts were divided into autobiographies and ordinary biographies, and fiction covered both short 'merry tales' and longer stories. The actual sampling was carried out by opening the book at two different pages and copying the text until the desired amount of text, usually circa 5000 words, had been reached. If this random method yielded passages inappropriate for our purposes, such as passages in verse or foreign languages, a new page was opened.
It is natural for corpus users to wonder why certain texts were included and others not. Very often, especially for the early material, there was not much choice, since the extant materials were limited. It was also rather easy to select many well-known texts on the basis of literary or genre histories; for example, for the late seventeenth century it was natural to select the texts of the two famous diarists Samuel Pepys and John Evelyn, rather than looking for other diarists. For periods where we had a genuine choice, the texts we selected were simply the ones we had our earliest access to. As regards private correspondence, we worked very hard to find sets of letters exchanged between family members, from parents to children and children to parents and between siblings.
The conditions for the compilation work were far from the ones we now have at our Research Unit. There was not even a room for Terttu and me to sit in and discuss our selections, so we had to try and find a corner in the library where our whispering would not disturb other people. After some time, we were given a locked cupboard in the department library to store our papers. Photocopying cost money, and we had to be careful not to exceed the resources we had been given. Despite the limited material conditions, we were enthusiastic about our unpaid work and set ourselves a goal to carry it out as well as we could. We learned a lot, we were excited about creating something new and, most importantly, we had great fun reading old texts.