M.Sc. Miika Leinonen defends his doctoral thesis "Optical Maps in Genome Assembly and Long k-mer Extraction" on Friday the 24th of May 2024 at 13 o'clock in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, basement). His opponent is Professor Sven Rahmann (Saarland University, Germany) and custos Professor Veli Mäkinen (University of Helsinki). The defence will be held in English.
The thesis of Miika Leinonen is a part of research done in the Department of Computer Science and in the Algorithms for Biological Sequencing Data team of the Algorithmic Bioinformatics group at the University of Helsinki. His supervisor has been University Lecturer, Docent Leena Salmela (University of Helsinki).
Optimal Maps in Genome Assembly and Long k-mer Extraction
Sequencing the entire human genome has been an ambitious endeavor, culminating in the recent achievement of a gapless human genome sequence in 2022. While this accomplishment is noteworthy, the benefits of genome sequencing have long been recognized. A lot of work goes into gathering, processing, and analyzing sequencing data. In this thesis, we explore some of the behind-the-scenes technical aspects of bioinformatics. More specifically, we take a look at the challenges associated with sequencing data itself and tools for its transformation into formats usable in downstream applications and analysis.
We start this thesis by taking a look at the genome assembly process, and how we can enhance it to obtain a more complete and accurate picture of the genome. Genome assembly is needed, because the sequencing data does not represent genomes completely accurately. The data is fragmented, and contains errors. The goal of genome assembly is to reconstruct an accurate depiction of the underlying genome, using the available imperfect data. The work in this thesis takes a look at how this process can be improved by taking advantage of additional data in the form of optical maps. We propose a genome assembly pipeline, that successfully takes advantage of optical maps to produce higher quality contigs.
The second part of this thesis is focused on the k-mer counting problem. Sequencing data is often split into arbitrarily long k-length sequences, k-mers. Doing this enables easier processing of the data, and analysis based on k-mer counts. Due to the errors in the sequencing data, finding longer k-mers accurately can be difficult. For this problem, we present two approaches that aim to find long k-mers accurately, even in the presence of errors. The first solution works well when only substitution errors are present. However, this is not a realistic situation and our second solution intends to fix this. The proposed method works well compared to conventional k-mer counting, but could not compete against it if the reads are corrected beforehand.
The last problem discussed in this thesis is also related to k-mers. This time, instead of focusing on the accuracy of the k-mer counting process, we look into how to represent the k-mers as memory-efficiently as possible. When sequencing data is split into k-mers, a lot of repetitive information is shared between them. To take advantage of this fact, we present a dictionary solution for long k-mers, where k-mers are stored implicitly. While our proposed data structure is slower than a plain hash table implementation, it can save a lot of space when processing long k-mers.
Availability of the dissertation
An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-952-84-0130-8.
Printed copies will be available on request from Miika Leinonen: miika.leinonen@helsinki.fi.