Ilari Maarala defends his PhD thesis on Scalable computational methods for high-throughput sequencing data analytics

On Friday the 17th of December 2021, M.Sc. (Tech) Ilari Maarala defends his PhD Thesis on Scalable computational methods for high-throughput sequencing data analytics in population genomics. The thesis is related to research done in the Department of Computer Science and in the Parallel and Distributed Computing group.

M.Sc. (Tech) Ilari Maarala defends his doctoral thesis Scalable computational methods for high-throughput sequencing data analytics in population genomics on Friday the 17th of December 2021 at 12 o'clock in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, basement). His opponent is PhD Rayan Chikhi (Institut Pasteur, France) and custos Professor Keijo Heljanko (University of Helsinki). The defence will be held in English. It is possible to follow the defence as a live stream at https://helsinki.zoom.us/j/64932724534.

The thesis of Ilari Maarala is a part of research done in the Department of Computer Science and in the Parallel and Distributed Computing group at the University of Helsinki. His supervisor has been Professor Keijo Heljanko (University of Helsinki).

Scalable computational methods for high-throughput sequencing data analytics in population genomics

High-throughput sequencing (HTS) technologies have enabled rapid DNA sequencing of whole-genomes collected from various organisms and environments, including human tissues, plants, soil, water, and air. As a result, sequencing data volumes have grown by several orders of magnitude, and the number of assembled whole-genomes is increasing rapidly as well. This whole-genome sequencing (WGS) data has revealed the genetic variation in humans and other species, and advanced various fields from human and microbial genomics to drug design and personalized medicine. The amount of sequencing data has almost doubled every six months, creating new possibilities but also big data challenges in genomics. Diverse methods used in modern computational biology require a vast amount of computational power, and advances in HTS technology are even widening the gap between the analysis input data and the analysis outcome.

Currently, many of the existing genomic analysis tools, algorithms, and pipelines are not fully exploiting the power of distributed and high-performance computing, which in turn limits the analysis throughput and restrains the deployment of the applications to clinical practice in the long run. Thus, the relevance of harnessing distributed and cloud computing in bioinformatics is more significant than ever before. Besides, efficient data compression and storage methods for genomic data processing and retrieval integrated with conventional bioinformatics tools are essential. These vast datasets have to be stored and structured in formats that can be managed, processed, searched, and analyzed efficiently in distributed systems.

Genomic data contain repetitive sequences, which is one key property in developing efficient compression algorithms to alleviate the data storage burden. Moreover, indexing compressed sequences appropriately for bioinformatics tools, such as read aligners, offers direct sequence search and alignment capabilities with compressed indexes. Relative Lempel-Ziv (RLZ) has been found to be an efficient compression method for repetitive genomes that complies with the data-parallel computing approach. RLZ has recently been used to build hybrid-indexes compatible with read aligners, and we focus on extending it with distributed computing. Data structures found in genomic data formats have properties suitable for parallelizing routine bioinformatics methods, e.g., sequence matching, read alignment, genome assembly, genotype imputation, and variant calling. Compressed indexing fused with the routine bioinformatics methods and data-parallel computing seems a promising approach to building population-scale genome analysis pipelines. Various data decomposition and transformation strategies are studied for optimizing data-parallel computing performance when such routine bioinformatics methods are executed in a complex pipeline. These novel distributed methods are studied in this dissertation and demonstrated in a generalized scalable bioinformatics analysis pipeline design.

The dissertation starts from the main concepts of genomics and DNA sequencing technologies and builds routine bioinformatics methods on the principles of distributed and parallel computing. This dissertation advances towards designing fully distributed and scalable bioinformatics pipelines focusing on population genomic problems where the input data sets are vast and the analysis results are hard to achieve with conventional computing. Finally, the methods studied are applied in scalable population genomics applications using real WGS data and experimented with in a high performance computing cluster. The experiments include mining virus sequences from human metagenomes, imputing genotypes from large-scale human populations, sequence alignment with compressed pan-genomic indexes, and assembling reference genomes for pan-genomic variant calling.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-7746-9.

Printed copies will be available on request from Ilari Maarala: ilari.maarala@helsinki.fi.