Sequencing technologies have developed to be cheap and accurate, leading to major breakthroughs, such as the complete sequence of a human genome, the creation of nationwide population gene banks, or the discovery of novel viruses. As the amount of data produced grows exponentially and their applications become more broad and complex, the community needs accurate computational methods that scale.
At the core of many algorithmic methods for processing sequencing data is the basic primitive of finding a set of paths or walks in graphs of various nature. Under different formulations and objective functions, the resulting problems can be NP-hard (e.g. flow decompositions) or polynomial-time (e.g. path covers), which are impractical on large graphs. Thus, many practical tools prefer fast heuristics to exact algorithms. While these may be optimized for specific inputs, they may not be reliable or accurate in general, which is a highly relevant issue in e.g. medical and life-science research.
This project will develop general methods to massively scale such exact graph algorithms. First, via novel graph structures usable in a preprocessing step: safety structures, e.g. sets of paths that can be quickly found to appear in all optimal solutions and thus simplify the problem; variation structures that limit the hardness of a problem only to graph areas rich in genetic variation. Second, via modern algorithmic techniques: parameterizing polynomial algorithms to run in time linear in the graph size and superlinear only in a small parameter; dynamic algorithms that, as the input grows, update solutions based only on the new data.
We will apply these methods in two high-impact applications: long-read RNA transcript discovery, and indexing massive and rapidly growing genomic databases. This project paves the way for exact graph algorithms usable independently of the problem complexity or of the input size, applicable to real-world problems.
Polynomially-solvable problems on huge graphs are "practically tractable" only if their complexity is sufficiently low. Likewise, NP-hard problems admitting specialized solvers are "practically tractable" only on small-enough inputs. We plan to enlarge these two limits of "practical tractability", motivated by two key applications in Bioinformatics. Pangenomic graphs represent all the genetic variation in a population. On these, we will develop fixed-parameter tractable linear-time algorithms, whose complexity is linear in the size of the input, and possibly super-linear in the size of a small parameter of input. In transcriptomics we need to solve NP-hard problems about network flows. Here we will use safe subpaths to simplify the input to these problems. Our expected results will significantly speed up existing Bioinformatics methods, and scale them to inputs that were not feasible before. We expect them to be adopted also for other graph problems in Computer Science at large.
The flow decomposition problem asks to decompose a flow in a graph into weighted paths. In Bioinformatics, this has applications in multiassembly problems (of RNA transcripts or viral quasispecies, e.g., HIV, or SARS-CoV-2). The state-of-the-art around this problem is lacking both in theory, and in practice, with Bioinformatics multiassembly software having a precision of just 50% on some datasets. This project aims to develop the algorithmic theory around this problem, that could provide practitioners with solid algorithmic building blocks usable by future multiassembly methods. By also implementing the best of our algorithms, we aim to develop practical Bioinformatics software with a significantly improved precision arising from our new exact algorithms, from incorporating all the available information, and from properly handling the issue of multiple optimal multiassembly solutions.
Many real-world problems are modeled as computational problems, but unfortunately with incomplete data or knowledge. As such, they may admit a large number of solutions, and we have no way of finding the correct one. This issue is sometimes addressed by outputting all solutions, which is infeasible for many practical problems. We aim to construct a general methodology for finding the set of all sub-solutions common to all solutions. We can ultimately trust these to be part of the correct solution. We call this set "safe". Ultimately, we aim at creating automated and efficient ways of reporting all safe sub-solutions of a problem.
The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general.
This project will derive the first safe algorithms for a number of fundamental problems about walks in graphs, network flows, dynamic programming. We will apply these inside practical tools for genome assembly, RNA assembly and pan-genome analysis. This is very relevant at the moment, because HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.