Interested in pan-genomizing your variant calling pipeline, or network flows applied to transcriptomics or metagenomics analysis? Some selected tools from the GSA group can be found here.

DAGChainer is a prototype implementation of the algorithms proposed in the article ”Using Minimum Path Cover to Boost Dynamic Programming in DAGs: Co-linear Chaining Extended” by A. Kuosmanen, T. Paavilainen, T. Gagie, R. Chikhi, A. Tomescu and V. Mäkinen. In Proc. RECOMB 2018, Springer, LNCS 10812, pp. 105-121, 2018.

ColinearSolver is the main class in DAGchainer implementing the algorithms. Class SequenceGraph models a graph where each node represents a single base. Currently SequenceGraph supports only creation from a splicing graph. There are two approaches available for finding maximal exact matches (to be used as anchors for the colinear chain algorithm) between a sequence graph and a sequence: slow naive approach based on graph traversal and fast approach based on GCSA2. Additionally the project includes various utility classes, such as a static Range Maximum Query Tree implementation, SAM format reader and FASTA format reader.

The folder MC-MPC contains a copy of Topi Paavilainen's Master's thesis project, for finding the minimum path cover in a graph. The folder traphlor contains a slightly modified copy of Traphlor software (see below).

The project includes a pipeline (transcriptPipeline) that takes as input short read alignments and long read sequences, and predicts transcripts from them in the following steps:

- a splicing graph is created from the short reads
- long reads are aligned to the splicing graph using colinear chaining
- colinear chains are converted into subpath constraints (chains of exons)
- Traphlor is used to predict transcripts based on both short and long reads

A framework to Pan-Genomize your Variant Calling pipeline.

Source code:

Reproducibility:

- Download scripts to generate/fetch input data for the experiments
- Download scripts to run the experiments

Pan-genome indexing:

- Download pre-built index (9.6GB) for 2001 genomes (including only numbered-chromosomes)

A software for RNA transcript expression prediction from long read RNA-sequencing data.

Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms). The development of third-generation sequencers allowed for sequencing of reads up to several kilobases long. Compared to the short next-generation sequencing reads, which generally only span two exons at most, long reads can give additional information about which non-neighboring exons are part of which transcript.

Traphlor is a novel transcript prediction tool that utilizes the connectivity information gained from long reads spanning more than two exons. It is based on the idea of modeling long reads as subpath constraints, presented in the article On the complexity of Minimum Path Cover with Subpath Constraints for multi-assembly Rizzi et al.

- Download the latest version here.
- The simulated data and scripts used in our experiments are available here.

For any questions, please contact us: aekuosma[at]cs.helsinki.fi

Maximal Unique Matches (MUMs) computed with a bidirectional BWT-index. This is an implementation of one of the results in our ESA 2013 paper, plugged into to Simon Gog’s SDSL library. We conducted a small benchmark to show it is competive. Two versions of human chromosome 20 (2 × 60 MB) were used as input, and running time and peak space usage was compared againts vmatch and mummer. We used suffix array sampling frequency 256 in our implementation; the mechanism described in ESA paper to avoid suffix array samples completely was not implemented.

time (s) | space (MB) | |

ours | 751 | 207 |
---|---|---|

vmatch | 437 | 938 |

mummer | 97 | 930 |

*corrected* N50. The strategy to extract correct alignments is different in these two, but on real data they seem to obtain similar results.

A software that implements an external-memory algorithm constructing the so-called LZ-End parsing (a variation of LZ77) of a given string of length n in O(n log L) expected time using O(z + L) space, where z is the number of phrases in the parsing and L is the limit on the length of the phrase. This parsing serves as a basis for a compressed index of Kreft and Navarro that allows fast access to the compressed string without uncompression. The algorithm constructs the parsing in streaming fashion in one left to right pass on the input string w.h.p. and performs one right to left pass to verify the correctness of the result.

Details of the algorithms and experimental evaluation can be found in the paper: Dominik Kempa, Dmitry Kosolobov "LZ-End Parsing in Compressed Space". In 2017 Data Compression Conference (DCC 2017), pages 350-359, 2017 (http://ieeexplore.ieee.org/document/7923708/) (http://arxiv.org/abs/1611.01769) The up-to-date version of the software is availiable at https://www.cs.helsinki.fi/u/dkempa/lz_end_toolkit.html