Size and sampling
(Adapted from project website)
Texts from specific genres were selected differently from the general 'universe' of potential Ontario English texts: quasi-random selection was only possible in the case of letters, as these came on microfilm, allowing me to select every fifth or seventh letter, to reach the targeted number.
For diaries, the scarcity of verbatim editions and manuscripts ruled out any statistical method of discrimination. The procedure applied for the selection of texts took the holdings of the Archives of Ontario and the University of Toronto Libraries as a starting point.7 Anne Powell's travel diary of 1789 and the beginning of Ely Playter's diary from 1799 serve as evidence for the first period from 1776 to 1799. In this genre, what was found and proved to be reliable data is included.
With newspapers, however, we are luckily in a better position, since data are readily available for all periods except the first. Again, the holdings of the Archives of Ontario and the University of Toronto Libraries served as a starting point. Generally, newspapers from smaller villages are preferred over those from bigger ones. Therefore, we find the Wingham Times and not the Toronto Star in the period from 1875 to 1899. The preference for small local newspapers, as opposed to large national ones, arises from the presumably higher amount of linguistic variation that they offer.
All in all, the corpus comprises some 125,000 words over three genres, approximately 10,000 to 20,000 per genre and period. At least two texts are included for each genre and period. The goal to include chunks between 5,000 and 10,000 words is not always met, but it was ensured that for diaries and newspapers one chunk of at least 2,000 words is included, which should provide a minimum to carry out syntactic studies. For letters, the sample sizes depended on the length of the letters, as they were transcribed in full.