Projects

Some selected projects of the research groups are collected here.

USING OVERLAPS to ASSEBLE the GENOME

Genome assembly is the problem of reverse engineering the full chromosome content of a species given a set of short sequenced DNA fragments and possibly other information.

Some early contributions to genome assembly were accomplished in Helsinki in the early 80’s. These included computation of approximate overlaps by dynamic programming and modeling assembly through the shortest common superstring formulation.

  • H. Peltola, H. Söderlund, J. Tarhio & E. Ukkonen: Algorithms for some string matching problems arising in molecular genetics. Proc. IFIP Congress 83, 59–64, Elsevier 1983.
  • H. Peltola, H. Söderlund & E. Ukkonen: SEQAID: A DNA sequence assembling program based on a mathematical model. Nucleic Acids Research 12 (1984), 307–321.
  • J. Tarhio & E. Ukkonen: A greedy approximation algorithm for constructing shortest common superstrings. Theoretical Computer Science 57 (1988), 131–145.
  • E. Ukkonen: A linear time algorithm for finding approximate shortest common superstrings. Algorithmica 5 (1990), 313–323.

Very recently,  a space-efficient solution was found to the computation  of approximate shortest common superstrings.

  • Jarno Alanko and Tuukka Norri. Greedy shortest common superstring approximation in compact space. To appear in Proc. SPIRE 2017.

While there are many applications for the shortest superstring problem, the genome assembly problem is now commonly approached with a modular framework: (1) error correction, (2) contig assembly, (3) scaffolding, and (4) gap filling (the figure illustrates this last step).

Error correction

We developed the first sequencing error correction method able to handle reads from different short read technologies including Illumina, SOLiD and 454 together.

We presented a method based on multiple sequence alignments for the correction of sequencing errors in short read data. In contrast to most of the work in the field concentrating on a single technology, the method proposed in this paper can be tuned to different error models of the various sequencing technologies.

We developed the first alignment-free method for the correction of sequencing errors in long but highly erroneous reads with the help of short and accurate reads. The method is based on building a de Bruijn graph of the short reads and then threading the long reads through this graph. The method achieves similar accuracy as previous methods but is six times faster and uses 93% less memory than previous methods. The efficiency of the method is due to the use of state-of-the-art data structures with high quality implementations.

Later we extended this hybrid approach to a method using only long reads.

Contig assembly

There was a long gap in studying the core routine of genome assembly, that is, contig assembly (see below for a definition), after the early contributions in the 80’s. Meanwhile, de Bruijn graphs and Burrows-Wheeler indexing techniques were introduced, and we adopted the latter for space-efficient computation of approximate overlaps.

Assume then we have fixed a notion of solution of the genome assembly problem. The contig assembly problem can be mathematically formulated as finding the strings that are common to all assembly solutions. When the notion of an assembly solution is a circular walk covering all nodes/edges of an assembly graph, we have given a characterization (omnitigs) of the walks in a de Bruijn graph corresponding to all such possible contigs:

In particular, if we assume no sequencing errors, complete genome coverage and no polyploidity, then omnitigs are a full characterization of everything that can be assembled from a de Bruijn graph. Later, we gave an improved algorithm for finding all omnitigs of a graph, that is also optimal in the output size:

  • Cairo, Medvedev, Obscura Acosta, Rizzi, Tomescu, Optimal Omnitig Listing for Safe and Complete Contig Assembly, In Proc. CPM 2017.

Scaffolding

We studied the scaffolding problem in genome assembly where preassembled contigs are ordered based on mate pair data. We presented a new method which divides the problem into small subproblems of restricted size and solves these exactly with mixed integer programming. In particular we develop a technique restricting the size of the subproblems so that the subproblems can be solved exactly.

Gap filling

We presented a rigorous formulation of the gap filling phase of genome assembly and show that the problem is NP-complete. We further give a pseudopolynomial algorithm for gap filling and show that it is practical for bacterial genomes. These approaches used the exact path length problem studied earlier a different application in mind.

  • M. Nykänen & E. Ukkonen: The exact path length problem. Journal of Algorithms 42 (2002), 41–53.

  • Salmela, Sahlin, Mäkinen, Tomescu. Gap Filling as Exact Path Length Problem. Journal of Computational Biology 23(5);347–361. 2016 
  • Salmela, Tomescu. Safely filling gaps with partial solutions common to all solutions. In Proc. WABI 2016.  

Assembly validation etc.

Applications

The scaffolding and validation tools described above have been used in a recent assembly project:

A Collection of Genome Assembly Software from Helsinki