Sequencing technologies have developed to be cheap and accurate, leading to major breakthroughs, such as the complete sequence of a human genome, the creation of nationwide population gene banks, or the discovery of novel viruses. As the amount of data produced grows exponentially and their applications become more broad and complex, the community needs accurate computational methods that scale.
At the core of many algorithmic methods for processing sequencing data is the basic primitive of finding a set of paths or walks in graphs of various nature. Under different formulations and objective functions, the resulting problems can be NP-hard (e.g. flow decompositions) or polynomial-time (e.g. path covers), which are impractical on large graphs. Thus, many practical tools prefer fast heuristics to exact algorithms. While these may be optimized for specific inputs, they may not be reliable or accurate in general, which is a highly relevant issue in e.g. medical and life-science research.
This project will develop general methods to massively scale such exact graph algorithms. First, via novel graph structures usable in a preprocessing step: safety structures, e.g. sets of paths that can be quickly found to appear in all optimal solutions and thus simplify the problem; variation structures that limit the hardness of a problem only to graph areas rich in genetic variation. Second, via modern algorithmic techniques: parameterizing polynomial algorithms to run in time linear in the graph size and superlinear only in a small parameter; dynamic algorithms that, as the input grows, update solutions based only on the new data.
We will apply these methods in two high-impact applications: long-read RNA transcript discovery, and indexing massive and rapidly growing genomic databases. This project paves the way for exact graph algorithms usable independently of the problem complexity or of the input size, applicable to real-world problems.
Polynomially-solvable problems on huge graphs are "practically tractable" only if their complexity is sufficiently low. Likewise, NP-hard problems admitting specialized solvers are "practically tractable" only on small-enough inputs. We plan to enlarge these two limits of "practical tractability", motivated by two key applications in Bioinformatics. Pangenomic graphs represent all the genetic variation in a population. On these, we will develop fixed-parameter tractable linear-time algorithms, whose complexity is linear in the size of the input, and possibly super-linear in the size of a small parameter of input. In transcriptomics we need to solve NP-hard problems about network flows. Here we will use safe subpaths to simplify the input to these problems. Our expected results will significantly speed up existing Bioinformatics methods, and scale them to inputs that were not feasible before. We expect them to be adopted also for other graph problems in Computer Science at large.
The flow decomposition problem asks to decompose a flow in a graph into weighted paths. In Bioinformatics, this has applications in multiassembly problems (of RNA transcripts or viral quasispecies, e.g., HIV, or SARS-CoV-2). The state-of-the-art around this problem is lacking both in theory, and in practice, with Bioinformatics multiassembly software having a precision of just 50% on some datasets. This project aims to develop the algorithmic theory around this problem, that could provide practitioners with solid algorithmic building blocks usable by future multiassembly methods. By also implementing the best of our algorithms, we aim to develop practical Bioinformatics software with a significantly improved precision arising from our new exact algorithms, from incorporating all the available information, and from properly handling the issue of multiple optimal multiassembly solutions.
The field of compressed data structures is a relatively young branch of computer science that aims to arrange data so that it occupies little space in a computer system, but also in such a way that supports fast search over and access to the original data. Already compressed data structures have had a major impact on other fields of science, perhaps most notably in the field of genomics, where, for example, they are used to efficiently assemble, represent, and search for patterns in sets of genomes. To date, however, the range of compressed data structures available have been largely inflexible to changes in the data they are built from, with even small changes requiring the compressed data structure to be rebuilt from scratch. This project will design the next generation of compressed data structures that can adapt dynamically and remain functional in the face of changing data, thus bringing their virtues to an entirely new range of applications in science and industry.
Many real-world problems are modeled as computational problems, but unfortunately with incomplete data or knowledge. As such, they may admit a large number of solutions, and we have no way of finding the correct one. This issue is sometimes addressed by outputting all solutions, which is infeasible for many practical problems. We aim to construct a general methodology for finding the set of all sub-solutions common to all solutions. We can ultimately trust these to be part of the correct solution. We call this set "safe". Ultimately, we aim at creating automated and efficient ways of reporting all safe sub-solutions of a problem.
The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general.
This project will derive the first safe algorithms for a number of fundamental problems about walks in graphs, network flows, dynamic programming. We will apply these inside practical tools for genome assembly, RNA assembly and pan-genome analysis. This is very relevant at the moment, because HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.
Recent breakthroughs in population genomics have demonstrated that mapping and understanding pangenomic variation within a species is becoming essential for most high-impact applications, such as biomarker discovery, pathogen surveillance, modeling tumor evolution and prediction of treatment outcomes. Screening the pangenome of a species leads into massive and complex data consisting of all genomic variation observed in the species. Only high performance computing is able to deal with this explosion of information. The project aims for high-impact results across the full spectrum of theory and practice and will deliver improved, practical methods for analysis, compression, and indexing of massive genomic collections using high performance computing technologies. As powerful demonstrators for particular chosen high-impact applications, we will deliver software for the next-generation of pathogen genomic surveillance and cancer genomics.
The de Bruijn graph is an important data structure for processing data produced by second generation sequencing machines which produce short but accurate sequencing reads. Many flavours of de Bruijn graphs such as the coloured de Bruijn graph and the variable order de Bruijn graph have been developed but most applications do not take advantage of these novel structures. We will develop genome assembly algorithms based on the variable order de Bruijn graph and we will use the coloured de Bruijn graph to create novel solutions for the assembly reconciliation problem and haplotype aware sequencing error correction.
The current de Bruijn graphs struggle to handle noisy data which limits their use for processing data from third generation sequencing technology which produces long but highly erroneous data. We will propose variants of the de Bruijn graph for handling noisy data which will make the computationally efficient methods based on de Bruijn graphs available for long read data.
Given the tremendous advances in DNA sequencing technology, new and more efficient methods are continuously needed to process the vast amounts of data produced. Biological sequences can be billions of nucleotides long and only short pieces can be read at a time. Therefore the original sequence needs to be inferred from the pieces. We will develop algorithms and tools for accurate and complete reconstruction of biological sequences which is essential for research in molecular biology.
For genome assembly we will develop methods that can improve the current fragmented draft assemblies to accurate chromosomal assemblies using genetic maps and optical maps. Additionally we will develop algorithms for haplotype assembly that can combine the benefits of high coverage short read data sets with lower coverage of more expensive long read data and thus provide a cost efficient way of producing accurate haplotypes.
The major scientific new direction of this project is to study what happens to selected classical sequence analysis tasks when labeled directed acyclic graphs (labeled DAGs) are used as inputs.
The project studies implications of bidirectional Burrows-Wheeler transform -based techniques to sequence analysis and widens the scope of such analyses to labeled directed acyclic graphs (labeled DAGs). The motivation for these new directions come from several timely applications in genome research, e.g., in new ways to represent diploid genomes and pan-genomes in place of the commonly adopted use of a single sequence as the basis of analyses.
For the results obtained in this project, see the research themes space-efficient sequence analysis, transcript assembly, and recombination-aware sequence alignment of the genome-scale algorithmics team.