Corpus of Video-Mediated English as a Lingua Franca Conversations (ViMELF)
ViMELF, the Corpus of Video-Mediated English as a Lingua Franca Conversations, contains 20 Skype conversations between 40 speakers from Germany (20 speakers), Spain (5), Italy (5), Finland (5), and Bulgaria (5), totaling 744.5 minutes (ca. 12.5 hours), with an average conversation length of 37.23 minutes. The corpus comprises 113,670 words in the plain text version and 152,472 items in the annotated version. The transcripts are available as .docx and .txt files; the videos in MPEG4 format. Several versions are available: the fully annotated pragmatic version as text and XML, a lexical version, and a POS-tagged version. Sociolinguistic background information of participants is also provided.
Project leader: Prof. Stefan Diemer, Trier University of Applied Sciences
Time of compilation: 2012–2018
Size: 152,472 words
Number of texts/samples: 20
Contact email: email@example.com
Project home page: http://umwelt-campus.de/case
ViMELF in numbers:
- 20 Conversations
- Conversation length: 744.5 min total, ca. 12.5 hours of conversations
- Average conversation length: 37.23 min.
- Words/Tokens: 113670 (plain text), 154472 (annotated version)
- Participants: 40 (20 SB, 5 FL, 5 HE, 5 ST, 5 SF)
- Medium: Video both sides: 11, video one side: 3, audio: 6
Reference line and copyright
ViMELF. 2018. Corpus of Video-Mediated English as a Lingua Franca Conversations. Birkenfeld: Trier University of Applied Sciences. Version 1.0. The CASE project [umwelt-campus.de/case].
ViMELF – Corpus of Video-Mediated English as a Lingua Franca Conversations. © The CASE Project, Trier University of Applied Sciences, compilers: Stefan Diemer, Marie-Louise Brunner, Caroline Collet, Selina Schmidt.
Corpus description: https://www.umwelt-campus.de/ucb/index.php?id=12246
General information on data: https://www.umwelt-campus.de/ucb/index.php?id=11349
Transcription conventions: https://umwelt-campus.de/case-conventions
Project coordination: Stefan Diemer & Marie-Louise Brunner
Transcription and proofreading: Janine Dieterle, Julian Laudwein, Sina Burghardt
The corpus is freely downloadable for non-commercial research purposes upon free subscription. More information and subscription.
Four versions of the corpus are available:
- CASE transcription (as docx, rtf and txt): the basic version produced by manual transcription. CASE transcription conventions include spoken language features beyond the words, such as prosodic, paralinguistic and non-verbal features.
- XML version (xml): a version of the annotated CASE transcription encapsulating the original information in a machine-readable form – this version is produced with XTranscript (Gee 2018)
- Lexical version (lex): For the lexical version all annotation is removed - this version is produced with XTranscript (Gee 2018)
- Part-of-speech tagged version (pos): a POS-tagged version of the lexical version, produced with the CLAWS POS tagger (C7 tagset).
- Video/audio files are provided in MPEG4 format
Corpus of Academic Spoken English (CASE)
CoRD Entry submitted on May 30, 2018 by Stefan Diemer.