Many real-world problems are modeled as computational problems, but unfortunately with incomplete data or knowledge. As such, they may admit a large number of solutions, and we have no way of finding the correct one. This issue is sometimes addressed by outputting all solutions, which is infeasible for many practical problems. We aim to construct a general methodology for finding the set of all sub-solutions common to all solutions. We can ultimately trust these to be part of the correct solution. We call this set "safe". Ultimately, we aim at creating automated and efficient ways of reporting all safe sub-solutions of a problem.

The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general.

This project will derive the first safe algorithms for a number of fundamental problems about walks in graphs, network flows, dynamic programming. We will apply these inside practical tools for genome assembly, RNA assembly and pan-genome analysis. This is very relevant at the moment, because HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.

More details

The de Bruijn graph is an important data structure for processing data produced by second generation sequencing machines which produce short but accurate sequencing reads. Many flavours of de Bruijn graphs such as the coloured de Bruijn graph and the variable order de Bruijn graph have been developed but most applications do not take advantage of these novel structures. We will develop genome assembly algorithms based on the variable order de Bruijn graph and we will use the coloured de Bruijn graph to create novel solutions for the assembly reconciliation problem and haplotype aware sequencing error correction.

The current de Bruijn graphs struggle to handle noisy data which limits their use for processing data from third generation sequencing technology which produces long but highly erroneous data. We will propose variants of the de Bruijn graph for handling noisy data which will make the computationally efficient methods based on de Bruijn graphs available for long read data.

Given the tremendous advances in DNA sequencing technology, new and more efficient methods are continuously needed to process the vast amounts of data produced.  Biological sequences can be billions of nucleotides long and only short pieces can be read at a time. Therefore the original sequence needs to be inferred from the pieces. We will develop algorithms and tools for accurate and complete reconstruction of biological sequences which is essential for research in molecular biology.

For genome assembly we will develop methods that can improve the current fragmented draft assemblies to accurate chromosomal assemblies using genetic maps and optical maps. Additionally we will develop algorithms for haplotype assembly that can combine the benefits of high coverage short read data sets with lower coverage of more expensive long read data and thus provide a cost efficient way of producing accurate haplotypes.

The major scientific new direction of this project is to study what happens to selected classical sequence analysis tasks when labeled directed acyclic graphs (labeled DAGs) are used as inputs.

The project studies implications of bidirectional Burrows-Wheeler transform -based techniques to sequence analysis and widens the scope of such analyses to labeled directed acyclic graphs (labeled DAGs). The motivation for these new directions come from several timely applications in genome research, e.g., in new ways to  represent diploid genomes and pan-genomes in place of the commonly adopted use of a single sequence as the basis of analyses.

For the results obtained in this project, see the research themes space-efficient sequence analysis, transcript assembly, and recombination-aware sequence alignment of the genome-scale algorithmics team.