USING OVERLAPS to ASSEBLE the GENOME
Genome assembly is the problem of reverse engineering the full chromosome content of a species given a set of short sequenced DNA fragments and possibly other information.
Some early contributions to genome assembly were accomplished in Helsinki in the early 80’s. These included computation of approximate overlaps by dynamic programming and modeling assembly through the shortest common superstring formulation.
- H. Peltola, H. Söderlund, J. Tarhio & E. Ukkonen: Algorithms for some string matching problems arising in molecular genetics. Proc. IFIP Congress 83, 59–64, Elsevier 1983.
- H. Peltola, H. Söderlund & E. Ukkonen: SEQAID: A DNA sequence assembling program based on a mathematical model. Nucleic Acids Research 12 (1984), 307–321.
- J. Tarhio & E. Ukkonen: A greedy approximation algorithm for constructing shortest common superstrings. Theoretical Computer Science 57 (1988), 131–145.
- E. Ukkonen: A linear time algorithm for finding approximate shortest common superstrings. Algorithmica 5 (1990), 313–323.
Very recently, a space-efficient solution was found to the computation of approximate shortest common superstrings.
- Jarno Alanko and Tuukka Norri. Greedy shortest common superstring approximation in compact space. To appear in Proc. SPIRE 2017.
While there are many applications for the shortest superstring problem, the genome assembly problem is now commonly approached with a modular framework: (1) error correction, (2) contig assembly, (3) scaffolding, and (4) gap filling (the figure illustrates this last step).
We developed the first sequencing error correction method able to handle reads from different short read technologies including Illumina, SOLiD and 454 together.
- Salmela: Correction of sequencing errors in a mixed set of reads. Bioinformatics 26(10):1284-1290, 2010.
We presented a method based on multiple sequence alignments for the correction of sequencing errors in short read data. In contrast to most of the work in the field concentrating on a single technology, the method proposed in this paper can be tuned to different error models of the various sequencing technologies.
- Salmela, Schröder: Correcting errors in short reads by multiple alignments. Bioinformatics 27(11):1455-1461, 2011. (Also in HiTSeq 2011).
We developed the first alignment-free method for the correction of sequencing errors in long but highly erroneous reads with the help of short and accurate reads. The method is based on building a de Bruijn graph of the short reads and then threading the long reads through this graph. The method achieves similar accuracy as previous methods but is six times faster and uses 93% less memory than previous methods. The efficiency of the method is due to the use of state-of-the-art data structures with high quality implementations.
Later we extended this hybrid approach to a method using only long reads.
- Salmela, Rivals: LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24):3506-3514, 2014.
- Salmela, Walve, Rivals, Ukkonen: Accurate selfcorrection of errors in long reads using de Bruijn graphs. Bioinformatics 33(6): 799-806, 2017. (Also in RECOMB-seq 2016).
There was a long gap in studying the core routine of genome assembly, that is, contig assembly (see below for a definition), after the early contributions in the 80’s. Meanwhile, de Bruijn graphs and Burrows-Wheeler indexing techniques were introduced, and we adopted the latter for space-efficient computation of approximate overlaps.
- Välimäki, Ladra, Mäkinen. Approximate all-pairs suffix/prefix overlaps. Information & Computation, 213:49-58, 2012. CPM 2010 Special Issue.
Assume then we have fixed a notion of solution of the genome assembly problem. The contig assembly problem can be mathematically formulated as finding the strings that are common to all assembly solutions. When the notion of an assembly solution is a circular walk covering all nodes/edges of an assembly graph, we have given a characterization (omnitigs) of the walks in a de Bruijn graph corresponding to all such possible contigs:
Tomescu, Medvedev. Safe and Complete Contig Assembly Via Omnitigs. In Proc. RECOMB 2016.
In particular, if we assume no sequencing errors, complete genome coverage and no polyploidity, then omnitigs are a full characterization of everything that can be assembled from a de Bruijn graph. Later, we gave an improved algorithm for finding all omnitigs of a graph, that is also optimal in the output size:
Cairo, Medvedev, Obscura Acosta, Rizzi, Tomescu, Optimal Omnitig Listing for Safe and Complete Contig Assembly, In Proc. CPM 2017.
We studied the scaffolding problem in genome assembly where preassembled contigs are ordered based on mate pair data. We presented a new method which divides the problem into small subproblems of restricted size and solves these exactly with mixed integer programming. In particular we develop a technique restricting the size of the subproblems so that the subproblems can be solved exactly.
- Salmela, Mäkinen, Välimäki, Ylinen, Ukkonen. Fast Scaffolding with Small Independent Mixed Integer Programs. Bioinformatics 27(23): 3259-3265, 2011.
We presented a rigorous formulation of the gap filling phase of genome assembly and show that the problem is NP-complete. We further give a pseudopolynomial algorithm for gap filling and show that it is practical for bacterial genomes. These approaches used the exact path length problem studied earlier a different application in mind.
M. Nykänen & E. Ukkonen: The exact path length problem. Journal of Algorithms 42 (2002), 41–53.
- Salmela, Sahlin, Mäkinen, Tomescu. Gap Filling as Exact Path Length Problem. Journal of Computational Biology 23(5);347–361. 2016
Salmela, Tomescu. Safely filling gaps with partial solutions common to all solutions. In Proc. WABI 2016.
Assembly validation etc.
- Mäkinen, Salmela, and Ylinen. Normalized N50 Assembly Metric using Gap-Restricted Co-Linear Chaining. BMC Bioinformatics, 13:255 (3 October 2012).
The scaffolding and validation tools described above have been used in a recent assembly project:
- Ahola V., Lehtonen R., Somervuo P. et al. The Glanville fritillary genome retains an ancient karyotype and reveals selective chromosomal fusions in Lepidoptera. Nature Communications, 5:4737, Sept. 2014.