Yahoo-based Contrastive Corpus of Questions and Answers (YCCQA)

YCCQA is a contrastive corpus of English, French, German and Spanish, based on the questions and answers submitted by users of the Yahoo Answers website. It thus consists of question-answer interactions between internet users, produced under almost identical circumstances, for the four languages. The near-identical production contexts allow contrastive analysis, but the sub-corpora can also be used independently for language-specific research. The language represented in the corpus is characteristically informal and unmonitored, illustrating the casual writing style of internet postings.

Project leader: Hendrik De Smet, Katholieke Universiteit Leuven / Research Foundation Flanders
Time of compilation: 2008–2009
Size: 29,400,000 words
Languages: English, French, German, Spanish
Number of texts/samples: about 90,000 questions and 575,000 answers
Period: 2006–2009
Released: 2009
Funding: Research Foundation Flanders

Reference lines and copyright

Contrastive Corpus of Questions and Answers. 2009. Compiled by Hendrik De Smet. Department of Linguistics, University of Leuven.


Hendrik De Smet


The corpus is available to all who are interested, free of charge, upon agreement to the terms and conditions of use. Please contact the compiler (

Technical information

The corpus consists of .txt files that can be searched using any concordancer.