From these one can create an artificial donor diploid genome:
reference | ACCGTATCGA |
donor | CACG-ATTGA |
CCCT-ATTGA |
Variation calling can then be applied, and the result compared to the actual variations chosen when the diploid was generated. But this comparison has to deal with the problem of invariance:
reference | ACCGTATCGA |
prediction 1 | CACT-ATTGA |
prediction 2 | -CACTATTGA |
Here both predictions are actually the same genome but the prediction 1 appears to be closer to the truth if one looks only at the variations. One way to avoid this problem is to look at the actual genome strings. This reveals that both the predictions are the same, and that they align perfectly with the donor diploid if we allow recombinations (here indicated by lowercase):
prediction | CACTATTGA |
donor | cacGattga |
CCCtATTGA |
This kind of alignment to a best possible recombination of the diploids is what DAlign does. DAlign gives the unit cost edit distance between the haploid and the best recombination of the diploids to use as an evaluation metric.
To create a reference P guided recombination of A and B, one first needs a mappings from A to P and from B to P.
P | ABCDEFG |
A | ABBCEFG |
B | ABCDEF |
The mapping for A would be 0112456 which tells that A contains an extra character in position 1 and that the character in position 3 in P is missing from A. The mapping for B would be 012345. If we have the knowledge that reference P is 7 characters long, we can tell that B is missing the last character from P. Note that this mapping does not tell whether the characters in A or B are actually the same as in P, just that they map to certain positions.
A reference guided recombination of A and B would now be a following kind of a string. If to a certain position in P there is a mapping from only A or B, then the character to that position is taken from the string that had the mapping. If both strings have a mapping to P, then the character can be taken from either one of the strings.
The tool dalign is writen in C and uses no external dependencies (excluding GNU make). Running 'make dalign' should build an executable 'dalign' which takes the following parameters in the given order:
To run the test suite of the tool, issue 'make test && ./test'. This requires that the C test framework Check is installed.
To run the benchmark used in the paper, issue 'make test && ./benchmark input.txt n' where input.txt contains the test data of your choice, in the same format as the genome files for dalign. The program will report the average time over n iterations of the algorithm with the same input.