Software

Interested in pan-genomizing your variant calling pipeline, or network flows applied to transcriptomics or metagenomics analysis?

Some selected tools from the GSA group can be found here.

Themisto

A metanenomic sample is a set of sequences of reads from microbial life living in a particular environment. Standard analysis involves estimating the species composition of the environment by aligning the reads against a reference database. Since the age of pangenomics, alignment is preferentially done against a variation graph encompassing all variation within a species.

is a space-efficient tool for indexing such variation graphs.

Seq2DAGChainer

Seq2DAGChainer is a prototype implementation of the algorithms proposed in the article by A. Kuosmanen, T. Paavilainen, T. Gagie, R. Chikhi, A. Tomescu and V. Mäkinen. In Proc. RECOMB 2018, Springer, LNCS 10812, pp. 105-121, 2018.

Founders

We published in  a linear time algorithm to segment a multiple sequence alignment (MSA) into founder blocks. Such blocks can be combined into founder sequences that encode the MSA rows. 

 optimizes the number of discontinuities (jumps between founders) to read the MSA rows through the founders. 

 

PanVC

A framework to Pan-Genomize your Variant Calling pipeline.

Source code:

Reproducibility:

  • to generate/fetch input data for the experiments
  • to run the experiments

Pan-genome indexing:

  • (9.6GB) for 2001 genomes (including only numbered-chromosomes)
MetaFlow

A program for community profiling of a metagenomic sample, described in our RECOMB 2016 paper.

Traphlor

A software for RNA transcript expression prediction from long read RNA-sequencing data.

Through transcription and alternative splicing, a gene can be transcribed into different RNA sequences (isoforms). The development of third-generation sequencers allowed for sequencing of reads up to several kilobases long. Compared to the short next-generation sequencing reads, which generally only span two exons at most, long reads can give additional information about which non-neighboring exons are part of which transcript.

Traphlor is a novel transcript prediction tool that utilizes the connectivity information gained from long reads spanning more than two exons. It is based on the idea of modeling long reads as subpath constraints, presented in the article Rizzi et al.

  • Download the latest version . (Update 2018-5-25: Please note that the original Traphlor cannot be compiled with newer compilers due to bamtools dependency. A working version of Traphlor can be found in the .)
  • The simulated data and scripts used in our experiments are available .

For any questions, please contact us: aekuosma[at]cs.helsinki.fi

LZ-End parser

A software that implements an external-memory algorithm constructing the so-called LZ-End parsing (a variation of LZ77) of a given string of length n in O(n log L) expected time using O(z + L) space, where z is the number of phrases in the parsing and L is the limit on the length of the phrase. This parsing serves as a basis for a compressed index of Kreft and Navarro that allows fast access to the compressed string without uncompression. The algorithm constructs the parsing in streaming fashion in one left to right pass on the input string w.h.p. and performs one right to left pass to verify the correctness of the result.

Details of the algorithms and experimental evaluation can be found in the paper: Dominik Kempa, Dmitry Kosolobov "LZ-End Parsing in Compressed Space". In 2017 Data Compression Conference (DCC 2017), pages 350-359, 2017 () () The up-to-date version of the software is availiable at

BW4SA

A library that contains some proof-of-concept implementations of the various sequence analysis tasks considered in our ESA 2013 paper. Available on .

MUMs

(MUMs) computed with a bidirectional BWT-index. This is an implementation of one of the results in our paper, plugged into to Simon Gog’s . We conducted a small benchmark to show it is competive. Two versions of human chromosome 20 (2 × 60 MB) were used as input, and running time and peak space usage was compared againts and . We used suffix array sampling frequency 256 in our implementation; the mechanism described in ESA paper to avoid suffix array samples completely was not implemented.

time (s) space (MB)
ours 751 207
vmatch 437 938
mummer 97 930
Traph

Software for RNA transcript expression prediction from RNA-sequencing data. See our RECOMB-seq and WABI 2013 papers.

DAlign

Computes the unit cost edit distance between a haploid and a reference guided recombination of two diploids.

Normalized N50 Calculator

to extract correctly aligning parts of (scaffold) assemblies and compute the resulting normalized N50. After we noticed that also has similar tool to compute corrected N50. The strategy to extract correct alignments is different in these two, but on real data they seem to obtain similar results.

GCSA

for indexing multiple alignment of several reference genomes or reference genome plus known variants.

all-against-all suffix/prefix alignment

for creating overlap graphs for de novo fragment assembly from short reads. Allows approximate overlaps and works in small space.

readaligner

A tool for mapping (short) DNA reads into reference sequences. This is not as fast as some other Burrows-Wheeler-based aligners, but implements faithfully k-mismatches and k-errors search where some other tools may solve a slighly different or implicitly defined problem.