Seq2DAGChainer

Download

The latest version is freely available at Github.

Citation

If you use Seq2DAGChainer, please cite it as:

Kuosmanen A., Paavilainen T., Gagie T., Chikhi R., Tomescu A., Mäkinen V. (2018) Using Minimum Path Cover to Boost Dynamic Programming on DAGs: Co-linear Chaining Extended. In: Raphael B. (eds) Research in Computational Molecular Biology. RECOMB 2018. Lecture Notes in Computer Science, vol 10812. Springer, Cham

Overview

Aligning sequencing reads on graph representations of genomes is an important ingredient of pan-genomics. Such approaches typically find a set of local anchors that indicate plausible matches between substrings of a read to subpaths of the graph. These anchor matches are then combined to form a (semi-local) alignment of the complete read on a subpath. Co-linear chaining is an algorithmically rigorous approach to combine the anchors. It is a well-known approach for the case of two sequences as inputs.

Seq2DAGChainer is a prototype implementation of an algorithm to extend co-linear chaining from between two sequences to between a sequence and a directed acyclic graph (a DAG), e.g. a splicing graph in transcriptomics or a variant graph in pan-genomics. The approach is based on covering the graph with a minimum number of paths, and treating each path as a sequence.

In co-linear chaining, the goal is to find such set of anchors that the area covered by the anchors in the sequence is maximized. When we consider each path of the graph as a sequence, there remains the problem that the solution with best coverage might jump between the paths. We solve this problem by introducing additional bookkeeping structure called forward links. Basically, forward links show to which positions in other paths you can jump from a given vertex.

The algorithm uses dynamic programming to find for each anchor the best coverage using any subset of anchors that preceded it in both the DAG and the sequence. For fast retrieval of the earlier best coverage values we use range maximum queries, with one data structure per path of the path cover. The forward links propagate the best coverage values to other paths where applicable. Compared to exhaustively checking all the options, our approach was shown to be from twice as fast (small graphs) to two orders of magnitude faster (very large graphs).

The implementation contains two methods for finding anchors (maximal exact matches) between a sequence and a DAG: a slow naïve approach based on graph traversal and a fast approach based on GCSA2.

We combined the co-linear chaining algorithm with our transcript prediction tool Traphlor to create a pipeline that takes as input short read alignments and long read sequences, and predicts transcripts from them in the following steps:

a splicing graph is created from the short reads
long reads are aligned to the splicing graph using colinear chaining
colinear chains are converted into subpath constraints (chains of exons)
Traphlor is used to predict transcripts based on both short and long reads