Research and publications

Project Setup

FoTran builds on multilingual neural models that learn representations from translated documents and possibly other sources. In the project, we aim at the development of effective multilingual language encoders (sub-project 1) that find language-agnostic meaning representations through cross-lingual grounding. In sub-project 2, we emphasize intrinsic evaluation and the interpretation of the representations that emerge from our models when training with different kinds of data. Sub-projects 3 and 4 focus on extrinsic evaluation and the application of language representations in down-stream tasks such as machine translation (sub-project 4) and other tasks that require some kind semantic reasoning (sub-project 3). The following picture illustrates the interactions between the various sub-projects.

FoTran project setup

We strongly believe in open science and our project will produce open tools and resources besides the new scientific results that we will publish in open publications. Below, we link to our resources and publications.

Resources

Software

Data

  • OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.
  • fiskmö data, Finnish-Swedish data for machine translation and computer-assisted translation
  • MuCoW, a test suite of contrastive examples for word sense disambiguation in machine translation
  • Helsinki prosody corpus, a corpus with annotated prosodic prominence (including software for predicting prominence)

Publications

 

Bjerva, J., Östling, R., Han Veiga, M., Tiedemann, J., & Augenstein, I. (2019). What do Language Representations Really Represent? Computational Linguistics.

Raganato,A., Scherrer, Y. and Tiedemann,J. (2019) The MuCoW test suite at WMT 2019: Automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation. In Proceedings of the Fourth Conference on Machine Translation (WMT): Shared Task Papers.

Raganato, A., Vázquez, R., Creutz, M. and Tiedemann, J. (2019) An Evaluation of Language-Agnostic Inner-Attention-Based Representations in Machine Translation. Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP).

Tiedemann, J. & Scherrer, Y. (2019). Measuring Semantic Abstraction of Multilingual NMT with Paraphrase Recognition and Generation Tasks, In: Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP. Rogers, A., Drozd, A., Rumshisky, A. & Goldberg, Y. (eds.). Stroudsburg: The Association for Computational Linguistics, p. 35-42.

Vázquez, R., Raganato, A., Tiedemann, J. and Creutz, M. (2019) Multilingual NMT with a language-independent attention bridge. Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP).

Alessandro Raganato and Jörg Tiedemann. (2018) An Analysis of Encoder Representations in Transformer-Based Machine Translation. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.

Tiedemann, J. (2018). Emerging Language Spaces Learned From Massively Multilingual Corpora. In Proceedings of the 3rd Conference on Digital Humanities in the Nordic Countries (DHN 2018), Helsinki, Finland.

Östling, R., Scherrer, Y., Tiedemann, J., Tang, G. and Nieminen, T. (2017). The Helsinki Neural Machine Translation System. In Proceedings of WMT at EMNLP 2017, Copenhagen/Denmark.

Östling, R. and Tiedemann, J. (2017). Continuous multilinguality with language vectors. In Proceedings of EACL, Valencia/Spain, pp. 644-649.

Tiedemann, J. (2016). Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016).

Lison, P. and Tiedemann, J. (2016). OpenSubtitles2015: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016).

Östling, R. (2015). Word order typology through multilingual word alignment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 205-211, Beijing, China, Association for Computational Linguistics.