Bioinformatics is a new and rapidly evolving discipline stemming from the fields of molecular biology and biochemistry, and from the algorithmic disciplines of computer science and mathematics. Many real-world problems of biological relevance can be modelled as computational problems but due to incomplete data, a large number of solutions are proposed, making it difficult to find the right one. The EU-funded SAFEBIO project derived the first safe algorithms for a number of fundamental algorithmic problems, and created automated and efficient ways of reporting all safe sub-solutions for problems. The project also applied these inside practical tools for genome assembly, RNA assembly and pan-genome analysis.
Many real-world problems are modeled as computational problems, but unfortunately with incomplete data or knowledge. As such, they may admit a large number of solutions, and we have no way of finding the correct one. This issue is sometimes addressed by outputting all solutions, which is infeasible for many practical problems. This project aimed at constructing a general methodology for finding the set of all sub-solutions common to all solutions. We can ultimately trust these to be part of the correct solution. We call this set safe. The ultimate goal is creating automated and efficient ways of reporting all safe sub-solutions of a problem.
The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general. This project derived the first safe and complete algorithms for a number of fundamental problems about walks in graphs, path covers, network flows, dynamic programming. These were applied inside practical tools for genome assembly, protein alignments and pangenome analysis.
All these are very relevant at the moment, since HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.
The work performed during this project resulted in articles published in computer science and bioinformatics venues. Below is a list of selected publications, grouped per topic.