Interested in pan-genomizing your variant calling pipeline, or network flows applied to transcriptomics or metagenomics analysis? Some selected tools from the GSA group can be found here.

# Software

A metanenomic sample is a set of sequences of reads from microbial life living in a particular environment. Standard analysis involves estimating the species composition of the environment by aligning the reads against a reference database. Since the age of pangenomics, alignment is preferentially done against a variation graph encompassing all variation within a species.

Themisto is a space-efficient tool for indexing such variation graphs.

Seq2DAGChainer is a prototype implementation of the algorithms proposed in the article ”Using Minimum Path Cover to Boost Dynamic Programming in DAGs: Co-linear Chaining Extended” by A. Kuosmanen, T. Paavilainen, T. Gagie, R. Chikhi, A. Tomescu and V. Mäkinen. In Proc. RECOMB 2018, Springer, LNCS 10812, pp. 105-121, 2018. Read more...

We published in WABI 2018 a linear time algorithm to segment a multiple alignment into *founder blocks*. A founder block is a shared genome segment. The algorithm is given the minimum length of a block and it finds the optimal segmentation such that the maximum of the number of distinct blocks required is minimized over all segments. If this maximum is *k*, then one can combine the blocks in arbitrary way to form *k* *founder sequences*, such that all rows of the input multiple alignment can be read through the founders block by block, possibly jumping at block boundaries. We also provide a bipartite matching based heuristic to minimize the number of jumps (discontinuities).

The tool is available at https://github.com/tsnorri/founder-sequences.

A framework to Pan-Genomize your Variant Calling pipeline.

Source code:

Reproducibility:

- Download scripts to generate/fetch input data for the experiments
- Download scripts to run the experiments

Pan-genome indexing:

- Download pre-built index (9.6GB) for 2001 genomes (including only numbered-chromosomes)

A program for community profiling of a metagenomic sample, described in our RECOMB 2016 paper. Read more…

A software for RNA transcript expression prediction from long read RNA-sequencing data.

Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms). The development of third-generation sequencers allowed for sequencing of reads up to several kilobases long. Compared to the short next-generation sequencing reads, which generally only span two exons at most, long reads can give additional information about which non-neighboring exons are part of which transcript.

Traphlor is a novel transcript prediction tool that utilizes the connectivity information gained from long reads spanning more than two exons. It is based on the idea of modeling long reads as subpath constraints, presented in the article On the complexity of Minimum Path Cover with Subpath Constraints for multi-assembly Rizzi et al.

- Download the latest version here. (Update 2018-5-25: Please note that the original Traphlor cannot be compiled with newer compilers due to bamtools dependency. A working version of Traphlor can be found in the Seq2DAGChainer project.)
- The simulated data and scripts used in our experiments are available here.

For any questions, please contact us: aekuosma[at]cs.helsinki.fi

A software that implements an external-memory algorithm constructing the so-called LZ-End parsing (a variation of LZ77) of a given string of length n in O(n log L) expected time using O(z + L) space, where z is the number of phrases in the parsing and L is the limit on the length of the phrase. This parsing serves as a basis for a compressed index of Kreft and Navarro that allows fast access to the compressed string without uncompression. The algorithm constructs the parsing in streaming fashion in one left to right pass on the input string w.h.p. and performs one right to left pass to verify the correctness of the result.

Details of the algorithms and experimental evaluation can be found in the paper: Dominik Kempa, Dmitry Kosolobov "LZ-End Parsing in Compressed Space". In 2017 Data Compression Conference (DCC 2017), pages 350-359, 2017 (http://ieeexplore.ieee.org/document/7923708/) (http://arxiv.org/abs/1611.01769) The up-to-date version of the software is availiable at https://www.cs.helsinki.fi/u/dkempa/lz_end_toolkit.html

A library that contains some proof-of-concept implementations of the various sequence analysis tasks considered in our ESA 2013 paper. Available on Github.

Maximal Unique Matches (MUMs) computed with a bidirectional BWT-index. This is an implementation of one of the results in our ESA 2013 paper, plugged into to Simon Gog’s SDSL library. We conducted a small benchmark to show it is competive. Two versions of human chromosome 20 (2 × 60 MB) were used as input, and running time and peak space usage was compared againts vmatch and mummer. We used suffix array sampling frequency 256 in our implementation; the mechanism described in ESA paper to avoid suffix array samples completely was not implemented.

time (s) | space (MB) | |

ours | 751 | 207 |
---|---|---|

vmatch | 437 | 938 |

mummer | 97 | 930 |

Software for RNA transcript expression prediction from RNA-sequencing data. See our RECOMB-seq and WABI 2013 papers. Read more…

Computes the unit cost edit distance between a haploid and a reference guided recombination of two diploids. Read more…

A tool to extract correctly aligning parts of (scaffold) assemblies and compute the resulting normalized N50. After publishing we noticed that GAGE also has similar tool to compute *corrected* N50. The strategy to extract correct alignments is different in these two, but on real data they seem to obtain similar results.

Geneneralized Compressed Suffix Array for indexing multiple alignment of several reference genomes or reference genome plus known variants.

A tool for creating overlap graphs for de novo fragment assembly from short reads. Allows approximate overlaps and works in small space.

A tool for mapping (short) DNA reads into reference sequences. This is not as fast as some other Burrows-Wheeler-based aligners, but implements faithfully k-mismatches and k-errors search where some other tools may solve a slighly different or implicitly defined problem. Read more…