M.Sc. Sebastian Schmidt defends his doctoral thesis "Unitigs Are Not Enough: the Advantages of Superunitig-Based Algorithms in Bioinformatics" on Thursday the 31st of August 2023 at 13 o'clock in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, basement). His opponent is Research Director Pierre Peterlongo (INRIA/Irisa – Campus de Beaulieu, France) and custos Associate Professor Alexandru I. Tomescu (University of Helsinki). The defence will be held in English.
The thesis of Sebastian Schmidt is a part of research done in the Department of Computer Science and in the Graph Algorithms team of the Algorithmic Bioinformatics group. His supervisor has been Associate Professor Alexandru I. Tomescu (University of Helsinki).
Unitigit Are Not Enough: the Advantages of Superunitig-Based Algorithms in Bioinformatics
Unitigs are a central construct in many subfields of bioinformatics, including genome assembly and the compact representation of k-mer spectra. In both of these subfields, using superunitigs (i.e. superstrings of unitigs) instead of unitigs results in considerable improvements. In genome assembly, unitigs are trivially safe, i.e. they are guaranteed to be part of the genome in an error-free setting. However, superunitigs may also be safe, and result in a more complete assembly, even in real genome assemblers. So far, the maximal safe superunitigs have been characterised for bacterial genomes and sets of bacterial genomes. Also, their efficient enumeration has been studied. But the theory behind safe genome assembly remains scattered, as existing publications have aimed to solve specific goals and did not attempt to describe the existing theory in common terms.
In this dissertation, we introduce cut paths, which are the core component of a safe walk from a graph theory point of view. By studying their remainder structure, we get a simple YES certificate for why walks are safe, whereas earlier characterisations worked only via NO certificates. The remainder structure also exhibits useful properties that result in a more efficient algorithm for metagenomic assembly via amortisation. In addition to studying bacterial genomes, we introduce superunitig algorithms for linear genomes that can also be used in an erroneous genome graph.
For the compact representation of k-mer spectra, we show that an optimal superunitig representation without k-mer repetitions can be computed in linear time. This was formerly discussed as a potential candidate for an NP-hard problem due to its similarity with Hamiltonian cycle when using node-centric de Bruijn graphs. However, by modelling the same problem using arc-centric de Bruijn graphs, we get a linear solution via Eulerian cycle, called Eulertigs.
With repetitions, we compute an optimal set of superunitigs, resulting in a representation of minimum size. We call these strings matchtigs. Our algorithm reduces the problem to a special variant of the Chinese postman problem, and then via min-cost integer flow to min-cost perfect matching. Both for our Eulertig and matchtigs algorithm, we provide an engineered implementation that can be applied even to some of the largest datasets available in bioinformatics today. In practice, using matchtigs can reduce the size of the representation by up to 59% over unitigs with small computational overhead. Additionally, using our superunitig-based representation, we can speed up k-mer based queries by up to 4.26 over unitigs.
Availability of the dissertation
An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-9380-3.
Printed copies will be available on request from Sebastian Schmidt: sebastian.schmidt@helsinki.fi.