Tuukka Norri defends his PhD thesis On Founder Segmentations of Aligned Texts

On Friday the 17th of June 2022, M.Sc. Tuukka Norri defends his doctoral thesis On Founder Segmentations of Aligned Texts. The thesis is related to research done in the Department of Computer Science and in the Genome-Scale Algorithmics team of the Algorithmic Bioinformatics group.

M.Sc. Tuukka Norri defends his doctoral thesis On Founder Segmentations of Aligned Texts on Friday the 17th of June 2022 at 13 o´clock in the University of Helsinki Exactum building, Auditorium B123 (Pietari Kalmin katu 5, 1st floor). His opponent is Professor Knut Reinert (Freie Univerität Berlin, Germany) and custos Professor Veli Mäkinen (University of Helsinki). The defence will be held in English. It is possible to follow the defence as a live stream at https://video.helsinki.fi/unitube/live-stream.html?room=l10.

The thesis of Tuukka Norri is a part of research done in the Department of Computer Science and in the Genome-Scale Algorithmics team of the Algorithmic Bioinformatics group at the University of Helsinki. His supervisor has been Professor Veli Mäkinen (University of Helsinki).

On Founder Segmentations of Aligned Texts

Aligned texts, such as sequence alignments or multiple sequence alignments, are sets of two or more texts in which corresponding parts have been identified. Typically the corresponding parts are determined in such a way that the amount of changes needed to transform one text to another by using editing operations, such as replacements, insertions or deletions, is minimised. Such text alignments have a variety of uses when applied to biological sequences.

In this dissertation, using a set of aligned texts as an input, we consider a set of related problems that have to do with finding segmentations, i.e. splitting the texts into shorter parts by some criteria. Our first aim is to identify equivalent parts of the texts for the purpose of data compression. From the resulting segmentation we would like to construct a smaller number of sequences. These are called founder sequences since they can theoretically be seen, considering DNA sequences, as those of the founding members of some population. We solve a variant of the problem where the maximum number of distinct segments over all groups of segments separated by segment boundaries is minimised, given that the segment boundaries occur in the same positions in all input texts. Our algorithm works in linear time and takes the minimum segment length as a parameter. We also adapt the algorithm to process a set of variant records directly, and use the resulting founder sequences to build an index for a genotyping workflow.

Our second aim is to find a segmentation, in which the text segments are distinct by certain criteria. Consequently, a type of graph can be built in which solving the problem of offline string matching on a labelled graph can be done efficiently. The graphs in question are called founder graphs that have the property of being repeat-free or semi-repeat-free. We present algorithms for finding the required segmentation, constructing an index data structure and determining if a given pattern occurs in the built graph in either linear or near-linear time. The achieved time complexities make the algorithms relevant for practical purposes.

The work is based on five published papers and previously unpublished research. In Paper I, we extend the positional Burrows-Wheeler transform to constant alphabets with more than two characters. The transform is used as an elementary part of the algorithm for segmenting a set of aligned texts for the purpose of generating founder sequences, presented in Paper II. These are incorporated into a genotyping workflow for short reads in Paper III, where we compare the variant calling accuracy to other workflows in experiments with both simulated and natural data. In particular, we show that utilising founder sequences this way results in good precision and recall especially in case of single-nucleotide variants.

In Paper IV, we describe founder graphs and show how to construct them from a set of aligned texts, in which the unaligned texts are all of the same length. We also show that if the node labels are repeat-free, i.e. sufficiently unique, the graph admits efficient indexing. We extend the theory on founder graphs in Paper V by showing how to construct an indexable founder graph from a set of aligned texts that also contains insertions and deletions. In this case, we make use of semi-repeat-free founder graphs.

Additionally, we show in this dissertation that semi-repeat-free founder graphs admit a type of prefix property. We make use of the property to augment the index generated from a founder graph to include path information for the purpose of identifying the path on which a given pattern occurs from a set of predefined paths. This part of the work has not been published earlier.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-8215-9.

Printed copies will be available on request from Tuukka Norri: tuukka.norri@helsinki.fi