Research and publications

FoTran builds on multilingual neural models that learn representations from translated documents and possibly other sources.

Project Setup

In the project, we aim at the development of effective multilingual language encoders (sub-project 1) that find language-agnostic meaning representations through cross-lingual grounding. In sub-project 2, we emphasize intrinsic evaluation and the interpretation of the representations that emerge from our models when training with different kinds of data. Sub-projects 3 and 4 focus on extrinsic evaluation and the application of language representations in down-stream tasks such as machine translation (sub-project 4) and other tasks that require some kind semantic reasoning (sub-project 3). The following picture illustrates the interactions between the various sub-projects.

We strongly believe in open science and our project will produce open tools and resources besides the new scientific results that we will publish in open publications. Below, we link to our resources and publications.

Resources

Software

  • , our software and tools in git repositories
    • , multilingual neural machine translation with language-specific components
    • , machine translation servers, and
    • , local MT plugins for translation workflows
    • , a fork of OpenNMT with our extension of shared cross-lingual layer (called "attention bridge" or "flexi-bridge")
    • , NLI with iterative refinement sentence encoders
    • and , tools for processing data
    • and implementing the
    • implementing the
    • , a package for movie subtitle alignment
    • and , efficient word aligner based on Gibbs sampling
    • , a multilingual word aligner
    • , The Helsinki Neural Machine Translation system, is a neural network-based machine translation system developed at the University of Helsinki and Stockholm University.
  • , some of our git repositories about cross-lingual NLP
    • , cross-lingual parsing
    • (a bit outdated by now)
  • , git repositories at the University's gitlab installation

Data

  • is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus.
  • , Finnish-Swedish data for machine translation and computer-assisted translation
  • , a test suite of contrastive examples for word sense disambiguation in machine translation
  • , a corpus with annotated prosodic prominence (including software for predicting prominence)

 

Recent publications