Research

Our latest developments focus on pan-genome indexing, space-efficient sequence analysis, alignments on sequence-like structures, and transcript and genome assembly.

Research is mostly carried out with our international collaborators and local colleagues.

Pan-genome indexing

An example of our recent developments is an extension of Burrows-Wheeler transform to finite automaton representing reference genome together with its common variations among the population (pan-genome). This enables a space-efficient index structure to be constructed to support efficient read alignment to a rich model of the population.

Another way to exploit pan-genomic information is to build an index directly on the multiple alignment representing individual genomes. Recently, we have developed Lempel-Ziv indexing approaches suitable for this scenario. We are working on seamless ways to improve variant calling exploiting these indexes.

  • Sirén,  Välimäki, Mäkinen. . IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2): 375-388, 2014.
  • Ferrada, Gagie, Hirvola, Puglisi, , Philosophical Transactions of the Royal Society A, Volume 372, (2014).
  • Travis Gagie and Simon J. Puglisi, , Frontiers in Bioengineering and Biotechnology, 3(12) (2015).
  • Daniel Valenzuela, Tuukka Norri, Niko Välimäki, Esa Pitkänen, Veli Mäkinen: . BMC Genomics, 19, Article 87, 2018.
  • Jarno N. Alanko, Travis Gagie, Gonzalo Navarro, Louisa Seelbach Benkner: . In Proc. DCC 2019.
  • Veli Mäkinen, Tuukka Norri: . Inf. Process. Lett. 146: 17-19 (2019).
  • Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen: . Algorithms for Molecular Biology 14(1): 12:1-12:15 (2019).
  • Massimo Equi, Roberto Grossi, Veli Mäkinen, Alexandru I. Tomescu: . In Proc. ICALP 2019.
  • Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen, Tuukka Norri: Linear time maximum segmentation problems in column stream model. In Proc. SPIRE 2019. To appear.
  • Jarno Alanko, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza: . Accepted to SODA 2020.

     

Space-efficient sequence analysis

We consider the classical sequence analysis tasks such as computing maximal repeats, maximal exact matches, maximal unique matches, string kernels, etc. using space within constant factor from the information theoretical minimum and time linear in the input length independent of the alphabet size. 

In ESA 2013 we showed that there exists a space-efficient variant of the bidirectional BWT index that can be used for linear time analysis, and in STOC 2014 Belazzougui showed how to build such an index in linear time and small space. That result required randomization, but in the combined extended version of these paper we show that deterministic construction and analysis is feasible within the same bounds, using a variant called unidirectional BWT index. A corollary is that suffix arrays and suffix trees can be built in linear time using sublinear extra space in the output size.

  • Belazzougui, Cunial, Kärkkäinen, Mäkinen. . In Proc. ESA 2013.
  • Belazzougui. . In Proc. STOC 2014.
  • Cunial, Belazzougui. . In Proc. SPIRE 2014.
  • Belazzougui, Cunial, Kärkkäinen, Mäkinen. . Manuscript in submission.
  • Alanko, Cunial, Belazzougui, Mäkinen: . BMC Bioinformatics 18(S-3): 49-60 (2017).
  • Jarno Alanko, Tuukka Norri: . In Proc. SPIRE 2017.
  • Fabio Cunial, Jarno Alanko, Djamal Belazzougui: . Bioinformatics. To appear.

     

Transcript assembly and quantification

The assembly of RNA-Seq reads can be naturally formulated as a graph theoretic problem of explaining a weighted DAG (directed acyclic graph) with weighted paths. In RECOMB-Seq 2013 we gave a polynomial-time solution for a variant of this problem, based on minimum-cost network flow. In WABI 2013 we showed that another variant is NP-hard, but that it admits an efficient solution based on dynamic programming.

In a later paper, we proposed a unifying problem statement for genome-guided multi-assembly problems. Moreover, we gave a more efficient algorithm for this general problem which exploits DAG decompositions based on a new graph parameter (arc-width). In RECOMB-Seq 2014 we investigated different problems which add long read, or paired-end read information to some multi-assembly formulations, which are solvable by minimum-cost network flow, or NP-hard, respectively.

  • Tomescu, Kuosmanen, Rizzi, Mäkinen. . BMC Bioinformatics, 14(Suppl 5):S15 (RECOMB-Seq 2013 Supplement).
  • Tomescu, Kuosmanen, Rizzi, Mäkinen. . In Proc.  WABI 2013. 
  • Tomescu, Gagie, Popa, Rizzi, Kuosmanen, Mäkinen. . IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2015.
  • Rizzi, Tomescu, Veli Mäkinen. . BMC Bioinformatics, 15(Suppl 9), S5, 2014 (RECOMB-seq 2014 supplement).
  • Kuosmanen, Norri, Mäkinen.. Briefings in Bioinformatics, 2017.
  • Kuosmanen, Paavilainen, Gagie, Chikhi, Tomescu, Mäkinen. . In Proc. RECOMB 2018.
  • Mäkinen, Tomescu, Kuosmanen, Paavilainen, Gagie, Chikhi. . ACM Transaction on Algorithms, 15(2): 29:1-29:21 (2019).
Recombination-aware sequence alignment

We have proposed generalizations of the traditional global alignment algorithms for the scenario in which one or two of the input genomes are represented with two strings; namely, mother-inherited and father-inherited chromosomes. These generalizations take  into account that individual diploid genomes evolve through a mutation and recombination process, and that predictions through variant calling and haplotype assembly may be erroneous in both dimensions. 

Our more general formulation of this alignment problem considers each genome as a DAG. A covering alignment consists of two paths in each DAG that cover all the nodes. The cost is given by the pair-wise edit distance of the strings spelled by these paths. The core problem of finding the optimal covering alignment is still open, but many variants of the setting can be solved efficiently. 

We are currently working in applications to evaluate and optimize variant calling and haplotyping algorithms.

  • Mäkinen, Rahkola. . BMC Bioinformatics 14(S-15): S13, 2013. 
  • Mäkinen, Valenzuela. . BMC Genomics 15(Suppl 6):S15, 2014. 
  • Mäkinen, Valenzuela. . In Proc. ISBRA 2015.
  • Rizzi, Cairo, Mäkinen, Valenzuela, Tomescu. . IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16(1):23-30, 2018.
Genome assembly

Many bioinformatics research groups at University of Helsinki have developed genome assembly methods. We have collected these activities into a .