SCALEBIO - ERC Consolidator Grant

Scalable Graph Algorithms for Bioinformatics (SCALEBIO) is a project taking place during Sep 2025 - Aug 2030 at the University of Helsinki. It is funded by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No. 101169716, SCALEBIO).
In brief

Advancements in sequencing technologies, such as human genome mapping, have paved the way for groundbreaking discoveries. However, as data volumes grow, there is a need for reliable computational methods. The ERC-funded SCALEBIO project aims to enhance the scalability of exact graph algorithms through innovative preprocessing graph structures and modern algorithmic techniques. Specifically, it will introduce safety structures to simplify problem-solving by identifying common paths in optimal solutions, along with variation structures that focus on areas with significant genetic variation. Key methodologies will include parameterised polynomial algorithms and dynamic algorithms that can adapt to new data. These techniques will be applied to areas such as discovering long-read RNA transcripts and indexing large genetic databases.

Project objectives

Sequencing technologies have developed to be cheap and accurate, leading to major breakthroughs, such as the complete sequence of a human genome, the creation of nationwide population gene banks, or the discovery of novel viruses. As the amount of data produced grows exponentially and their applications become more broad and complex, the community needs accurate computational methods that scale.

At the core of many algorithmic methods for processing sequencing data is the basic primitive of finding a set of paths or walks in graphs of various nature. Under different formulations and objective functions, the resulting problems can be NP-hard (e.g. flow decompositions) or polynomial-time (e.g. path covers), which are impractical on large graphs. Thus, many practical tools prefer fast heuristics to exact algorithms. While these may be optimized for specific inputs, they may not be reliable or accurate in general, which is a highly relevant issue in e.g. medical and life-science research.

This project will develop general methods to massively scale such exact graph algorithms. First, via novel graph structures usable in a preprocessing step: safety structures, e.g. sets of paths that can be quickly found to appear in all optimal solutions and thus simplify the problem; variation structures that limit the hardness of a problem only to graph areas rich in genetic variation. Second, via modern algorithmic techniques: parameterizing polynomial algorithms to run in time linear in the graph size and superlinear only in a small parameter; dynamic algorithms that, as the input grows, update solutions based only on the new data.

We will apply these methods in two high-impact applications: long-read RNA transcript discovery, and indexing massive and rapidly growing genomic databases.

This project paves the way for exact graph algorithms usable independently of the problem complexity or of the input size, applicable to real-world problems.