SAFEBIO - ERC Starting Grant

Safe and Complete Algorithms for Bioinformatics (SAFEBIO) was a project that took place during Mar 2020 - Feb 2025 at the University of Helsinki. It was funded by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 851093, SAFEBIO).

In brief

Bioinformatics is a new and rapidly evolving discipline stemming from the fields of molecular biology and biochemistry, and from the algorithmic disciplines of computer science and mathematics. Many real-world problems of biological relevance can be modelled as computational problems but due to incomplete data, a large number of solutions are proposed, making it difficult to find the right one. The EU-funded SAFEBIO project derived the first safe algorithms for a number of fundamental algorithmic problems, and created automated and efficient ways of reporting all safe sub-solutions for problems. The project also applied these inside practical tools for genome assembly, RNA assembly and pan-genome analysis.

Summary of the project

Many real-world problems are modeled as computational problems, but unfortunately with incomplete data or knowledge. As such, they may admit a large number of solutions, and we have no way of finding the correct one. This issue is sometimes addressed by outputting all solutions, which is infeasible for many practical problems. This project aimed at constructing a general methodology for finding the set of all sub-solutions common to all solutions. We can ultimately trust these to be part of the correct solution. We call this set safe. The ultimate goal is creating automated and efficient ways of reporting all safe sub-solutions of a problem.

The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general. This project derived the first safe and complete algorithms for a number of fundamental problems about walks in graphs, path covers, network flows, dynamic programming. These were applied inside practical tools for genome assembly, protein alignments and pangenome analysis.

All these are very relevant at the moment, since HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.

Publications

The work performed during this project resulted in articles published in computer science and bioinformatics venues. Below is a list of selected publications, grouped per topic.

Reachability problems in graphs

Nidia Obscura Acosta, Alexandru I. Tomescu, Simplicity in Eulerian circuits: Uniqueness and safety
Information Processing Letters 183, 106421, 2024
Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I. Tomescu, Elia C. Zirondelli, Cut paths and their remainder structure, with applications
STACS 2023 - 40th International Symposium on Theoretical Aspects of Computer Science, LIPIcs 254, 17:1--17:17, 2023
Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I. Tomescu, Safety in s-t Paths, Trails and Walks
Algorithmica 84, 719--741, 2022
Massimo Cairo, Shahbaz Khan, Romeo Rizzi, Sebastian Schmidt, Alexandru I. Tomescu, Elia C. Zirondelli, A simplified algorithm computing all s-t bridges and articulation points
Discrete Applied Mathematics, 2021

Genome assembly problems

Francisco Sena, Eliel Ingervo, Shahbaz Khan, Andrey Prjibelski, Sebastian Schmidt, Alexandru I. Tomescu, Flowtigs: safety in flow decompositions for assembly graphs
iScience 27(12), 111208, 2024 (Presented at RECOMB-seq 2024)
Sebastian Schmidt, Santeri Toivonen, Paul Medvedev*, Alexandru I. Tomescu*, Applying the Safe-And-Complete Framework to Practical Genome Assembly
WABI 2024 - 24th International Workshop on Algorithms in Bioinformatics, Leibniz International Proceedings in Informatics (LIPIcs) 312, 8:1--8:16, 2024 (*Equal contribution)
Massimo Cairo, Romeo Rizzi, Alexandru I. Tomescu, Elia C. Zirondelli, Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time
ACM Transactions on Algorithms 20(1), 1--26, 2023 (Extended version of ICALP 2021 paper)
Andrea Cracco, Alexandru I. Tomescu, Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT
Genome Research 33, 1198--1207, 2023 (Short abstract at RECOMB 2023)
Sebastian Schmidt, Shahbaz Khan, Jarno Alanko, Giulio E. Pibiri, Alexandru I. Tomescu, Matchtigs: minimum plain text representation of kmer sets
Genome Biology 24, 136, 2023 (Selected for talk at ISMB 2022)
Sebastian Schmidt, Jarno N. Alanko, Eulertigs: Minimum Plain Text Representation of k-mer Sets Without Repetitions in Linear Time. WABI 2022: 2:1-2:21, 2022 (Best Paper Award at WABI 2022)

Path covers

Manuel Cáceres, Brendan Mumey, Santeri Toivonen, Alexandru I. Tomescu, Practical Minimum Path Cover
SEA 2024 - 22nd International Symposium on Experimental Algorithms, Leibniz International Proceedings in Informatics (LIPIcs) 301, 3:1--3:19, 2024
Manuel Cáceres, Massimo Cairo, Brendan Mumey, Romeo Rizzi, Alexandru I. Tomescu, Sparsifying, Shrinking and Splicing for Minimum Path Cover in Parameterized Linear Time
SODA 2022 - ACM-SIAM Symposium on Discrete Algorithms, 359-376, 2022
Manuel Cáceres, Brendan Mumey, Edin Husić, Romeo Rizzi, Massimo Cairo, Kristoffer Sahlin, Alexandru I. Tomescu, Safety in multi-assembly via paths appearing in all path covers of a DAG
IEEE/ACM Transactions on Computational Biology and Bioinformatics 19(6), 3673-3684, 2022
Manuel Cáceres, Massimo Cairo, Brendan Mumey, Romeo Rizzi, Alexandru I. Tomescu, A linear-time parameterized algorithm for computing the width of a DAG
WG 2021 - 47th International Workshop on Graph-Theoretic Concepts in Computer Science, Lecture Notes in Computer Science 12911, 257--269, 2021

Flow decompositions

Fernando H. C. Dias, Alexandru I. Tomescu, Accurate Flow Decomposition via Robust Integer Linear Programming
IEEE/ACM Transactions on Computational Biology and Bioinformatics 21(6), 1955-1964, 2024
Manuel Cáceres, Massimo Cairo, Andreas Grigorjew, Shahbaz Khan, Brendan Mumey, Romeo Rizzi, Alexandru I. Tomescu, Lucia Williams, Width Helps and Hinders Splitting Flows
ACM Transactions on Algorithms 20(2), Article No.: 13, 2024 (Extended version of ESA 2022 paper)
Andreas Grigorjew, Fernando H. C. Dias, Andrea Cracco, Romeo Rizzi, Alexandru I. Tomescu, Accelerating ILP solvers for Minimum Flow Decompositions through search space and dimensionality reductions
SEA 2024 - 22nd International Symposium on Experimental Algorithms, Leibniz International Proceedings in Informatics (LIPIcs) 301, 14:1--14:19, 2024 (Runner-up for the Best Paper Award at SEA 2024, i.e. 2nd place)
Fernando H. C. Dias, Manuel Cáceres, Lucia Williams, Brendan Mumey, Alexandru I. Tomescu, A Safety Framework for Flow Decomposition Problems via Integer Linear Programming
Bioinformatics 39(11), btad640, 2023
Lucia Williams, Alexandru I. Tomescu, Brendan Mumey, Flow Decomposition with Subpath Constraints
IEEE/ACM Transactions on Computational Biology and Bioinformatics 20(1), 360--370, 2023 (Extended version of WABI 2021 paper)
Shahbaz Khan, Alexandru I. Tomescu, Optimizing the Safe Flow Decompositions in DAGs
ESA 2022 - European Symposium on Algorithms (Track A), 72:1--72:17, 2022
Shahbaz Khan, Milla Kortelainen, Manuel Cáceres, Lucia Williams, Alexandru I. Tomescu, Safety and Completeness in Flow Decompositions for RNA Assembly
RECOMB 2022 - 26th Annual International Conference on Research in Computational Molecular Biology, Lecture Notes in Computer Science 13278, 177--192, 2022
Fernando H. C. Dias, Lucia Williams, Brendan Mumey, Alexandru I. Tomescu, Fast, Flexible, and Exact Minimum Flow Decompositions via ILP
RECOMB 2022 - 26th Annual International Conference on Research in Computational Molecular Biology, Lecture Notes in Computer Science 13278, 230--245, 2022

Pangenomes and alignments

Andreas Grigorjew, Artur Gynter, Fernando Dias, Benjamin Buchfink, Hajk-Georg Drost*, Alexandru I. Tomescu*, Sensitive inference of alignment-safe intervals from biodiverse protein sequence clusters using EMERALD
Genome Biology 24, 168, 2023 (*Equal contribution. Selected for talk at ISMB 2023)
Jun Ma, Manuel Cáceres, Leena Salmela, Veli Mäkinen, Alexandru I. Tomescu, Chaining for accurate alignment of erroneous long reads to acyclic variation graphs
Bioinformatics 39(8), btad460, 2023
Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, Roberto Grossi, On the Complexity of String Matching for Graphs
ACM Transactions on Algorithms 19(3), 1--25, 2023 (Extended version of ICALP 2019 paper)
Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, Graphs Cannot be Indexed in Polynomial Time for Sub-quadratic Time String Matching, unless SETH Fails
Theoretical Computer Science 975(9), 114128, 2023 (Extended version of SOFSEM 2021 paper)
Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, Veli Mäkinen, Algorithms and Complexity on Indexing Founder Graphs
Algorithmica 85, 1586--1623, 2023 (Extended version of ISAAC 2021 paper)
Nicola Rizzo, Alexandru I. Tomescu, Alberto Policriti, Solving String Problems on Graphs Using the Labeled Direct Product
Algorithmica 84, 3008--3033, 2022
Veli Mäkinen, Bastien Cazaux, Massimo Equi, Tuukka Norri, Alexandru I. Tomescu, Linear Time Construction of Indexable Founder Block Graphs
WABI 2020 - 20th International Workshop on Algorithms in Bioinformatics, Leibniz International Proceedings in Informatics (LIPIcs) 172, 7:1--7:18, 2020