The tool is freely available at https://github.com/tsnorri/founder-sequences
We developed a linear time algorithm to segment a multiple alignment into founder blocks. A founder block is a shared genome segment. The algorithm is given the minimum length of a block, and it finds the optimal segmentation such that the maximum of the number of distinct blocks required is minimized over all segments. If this maximum is k, then one can combine the blocks in arbitrary way to form k founder sequences, such that all rows of the input multiple alignment can be read through the founders block by block, possibly jumping at block boundaries. We also provide a bipartite matching based heuristic to minimize the number of jumps (discontinuities).
Founders can be used for compactly representing the pangenome of virus strains. To illustrate this, we constucted the MSA of 1484 strains of SARS-CoV-2 stored in NCBI database on 24.04.2020 using ViralMSA. Running founder_sequences on this MSA with block length threshold 105 (--segment-length-bound=105) resulted into 186 founders, forming a founder MSA. To see how many discontinuities this compaction resulted into, we ran match_founder_sequences (after using this script) to map the rows of the original MSA into the founder MSA. In total, the original rows needed to be split into 8508 pieces to be read through the founder MSA. That is, on average one row was split into 8505/1434=5.93.. pieces. As the length of SARS-CoV-2 genome is about 30 000 bases, this means there is discontinuity at every 5058 bases, on average. Another way to think of this, is on a set of 100 bp reads from a strain included in the original MSA, 98% (1-100/5058) should find a match in the founder MSA.
A simple derivative of founder block computation is the creation of founder block graphs: nodes are founder blocks and edges are between consecutive blocks that are supported by at least one row of the original MSA. In our example illustration above, we would have 9 nodes and 11 edges. We provide a simple script to construct such a founder block graph from the output of founder sequence tool. In our COVID-19 example use case data, the script produced a founder block graph of 2418 nodes and 3896 edges.
If you use the tool, kindly cite us as follows:
Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen:
Linear time minimum segmentation enables scalable founder reconstruction. Algorithms Mol. Biol. 14(1): 12:1-12:15 (2019). Earlier in WABI 2018.