Tommi Mäklin defends his PhD thesis on Probabilistic Methods for High-Resolution Metagenomics

On Friday the 28th of October 2022, M.Sc. Tommi Mäklin defends his doctoral thesis on Probabilistic Methods for High-Resolution Metagenomics. The thesis is related to research done in the Department of Computer Science and in the Probabilistic Inference, Privacy and Computational Biology group.

M.Sc. Tommi Mäklin defends his doctoral thesis Probabilistic Methods for High-Resolution Metagenomics on Friday the 28th of October 2022 at 12 o'clock in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, basement). His opponent is Associate Professor Leo Lahti (University of Turku) and custos Associate Professor Antti Honkela (University of Helsinki). The defence will be held in English.

The thesis of Tommi Mäklin is a part of research done in the Department of Computer Science and in the Probabilistic Inference, Privacy and Computational Biology group at the University of Helsinki. His supervisors have been Associate Professor Antti Honkela and Professor Jukka Corander (University of Helsinki).

Probabilistic Methods for High-Resolution Metagenomics

Metagenomics is the analysis of DNA sequencing data from samples obtained directly from the environment and containing several different organisms at once. Common tasks in metagenomics are taxonomic profiling, where the goal is to identify the organisms present in the sample and assign relative abundances to them, and taxonomic binning, where the sequencing data from the sample is divided into bins that correspond to some sensible taxonomic units. This thesis introduces methods for performing these two tasks at a high-resolution capable of distinguishing between lineages of bacterial species. The first of these methods is mSWEEP, which solves the profiling task by utilizing a collection of grouped bacterial reference sequences, pseudoalignment, and a probabilistic model. The second method, mGEMS, builds upon mSWEEP to solve the binning task using an assignment rule derived from the fundamentals of the probabilistic model used by mSWEEP. Both methods are accompanied by efficient implementations that utilize fast variational inference and pseudoalignment to fit the model in a reasonable time, rendering them applicable to large-scale datasets.

Both mSWEEP and mGEMS have been developed for application in either the traditional whole community metagenomics context, where the direct-from-environment samples are analysed, or in the plate sweep metagenomics context, where the sample has been plated once on a selective medium. While the latter is not metagenomics in the traditional sense, this thesis advocates for its use when high depth sequencing data is required from some species and the other organisms are not of interest. Regardless of the type of metagenomics data used, the ultimate goal of both mSWEEP and mGEMS is to enable performing standard genomic epidemiological analyses directly from data containing several strains of the same bacteria, skipping the typically used isolation steps required to separate them. Due to the implied cost-savings from reducing the number of cultures that need to be performed as well as the better capture of variation in the samples through using metagenomics data, mSWEEP and mGEMS enable performing entirely novel types of analyses in the field of genomic epidemiology.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at

Printed copies will be available on request from Tommi Mäklin: