OCR and post-correction of historical newspapers and journals

Senka Drobac (University of Helsinki)

The corpus of historical newspapers and journals published in Finland, with more than 11 million pages of historical text, is of great value to the research community. However, the current OCR accuracy is too low for scientific research.

The material is particularly challenging for OCR due to its large size, versatility in different fonts from two font families (Blackletter and Antiqua) and usage of two main languages (Finnish and Swedish) in mostly non-standardized form.

In this presentation, I will be talking about methods that we have developed to OCR and post-correct this dataset. I will go through the best practices to select training data, how to find the optimal neural network for OCR, voting with different models and briefly introduce the post-correction.

Presentation slides

Aalto HELDIG DH pizza seminar on Friday 16 October 2020 at 12.00 (Zoom)