Take Part in the Karttu Test! Contact Information

Russian Language and Literature belongs to the Department of Modern Languages.

P.O. Box 24 (Unioninkatu 40 B)
FI-00014 University of Helsinki
Tel. +358 9 1912 2418
Fax +358 9 1912 3072

Electronic Material > The Helsinki Annotated Corpus of Russian Texts HANCO

По-русски


The HANCO Corpus project has been running since 2001 in the Department of Slavonic and Baltic Languages and Literatures at the University of Helsinki. It is envisaged that the corpus will include morphological, syntactic, and functional information about approximately 100, 000 running words, extracted from a modern Russian magazine and representing the modern Russian language.

The Head of the project is Professor Arto Mustajoki.

The main principles of creation

  1. Orientation to a wider audience. In drawing up the HANCO corpus and its computer interface, we have kept in mind as potential users not only a narrow circle of experts, but also students and teachers of Russian. This certainly does not mean that we completely avoid the use of linguistic terms, but the choice of parameters for a search is carried out in such a way as to minimize the amount of specialized knowledge required.
  2. Orientation to the accuracy of the grammatical description, not to the amount of annotated material. Our purpose is to create an annotated corpus containing more exact grammatical and functional information than is the case in existing or planned corpuses.
  3. Orientation to multilevel grammatical information The HANCO corpus contains multilateral grammatical information including morphological, syntactic, and functional (semantic) characteristics. They can be combined in the process of searching.
  4. Orientation to the traditional conception of a language. In compiling the set of parameters we have preferred established theoretical concepts, which are used in widely known linguistic works and/or in textbooks on Russian grammar. This is due to the demand of setting a low threshold for the use of the Corpus.
  5. Possibility of alternative interpretations. Any researcher working with concrete language material will have come across the fact that it is difficult or even impossible in many cases to classify linguistic units in an unequivocal way. In creating the HANCO, we made the decision to accept the possibility of alternative interpretations of linguistic facts. Such seeming illegibility demands a lot of manual work, but it facilitates the searching of necessary information by the potential user.
The following types of linguistic information are to be included in the HANCO.
  • Morphological information. In the HANCO, complete morphological description of every running word is given. The morphological analysis and the subsequent disambiguation procedure have been carried out automatically, with further manual processing.
  • Syntactic information. Syntactic information is given at three levels: word collocations, clauses, and sentences. The full description of units for every level will be given according to the Academy Grammar of Russian.
  • Functional-semantic information. The Corpus is a component of the the Contrastive Functional Syntax project conducted by Professor A. Mustajoki. Due to this link, the HANCO will also provide information based on semantic categories, the list of which is being elaborated by A. Mustajoki's scientific team.

The HANCO is created step by step. The first results (the morphological and syntactic parts) have already been achieved (see here).
The corpus is now available in MTE-format as well. (see here).