Jarkko Toivonen defends his PhD thesis on Modeling and Learning Monomeric and Dimeric Transcription Factor Binding Motifs

On Friday the 22nd of November 2019, M.Sc. Jarkko Toivonen will defend his doctoral thesis on Modeling and Learning Monomeric and Dimeric Transcription Factor Binding Motifs. The thesis is a part of research done in the Department of Computer Science and in the Combinatorial Pattern Matching research group at the University of Helsinki.

M.Sc. Jarkko Toivonen defends his doctoral thesis Modeling and Learning Monomeric and Dimeric Transcription Factor Binding Motifs on Friday the 22nd of November 2019 at 12 o'clock noon in the University of Helsinki Exactum building, Room D122 (Pietari Kalmin katu 5, 1st floor). His opponent is Professor Juho Rousu (Aalto University, Finland) and custos Professor Veli Mäkinen (University of Helsinki). The defence will be held in English.

The thesis of Jarkko Toivonen is a part of research done in the Department of Computer Science and in the Combinatorial Pattern Matching research group at the University of Helsinki. His supervisor has been Professor (Emeritus) Esko Ukkonen (University of Helsinki).

Modeling and Learning Monomeric and Dimeric Transcription Factor Binding Motifs

In this thesis we aim to learn models that can describe the sites in DNA that a transcription factor (TF) prefers to bind to. We concentrate on probabilistic models that give each DNA sequence, of fixed length, a probability of binding. The probability models used are inhomogeneous 0th and 1st order Markov chains, which are called in our terminology Position-specific Probability Matrix (PPM) and Adjacent Dinucleotide Model (ADM), respectively. We consider both the case where a single TF binds in isolation to DNA, and the case where two TFs bind to proximal locations in DNA, possibly having interactions between the two factors. We use two algorithmic approaches to this learning task.

Both approaches utilize data, which is assumed to have enriched number of binding sites of the TF(s) under investigation. Then the binding sites in the data need to be located and used to learn the parameters of the binding model. Both methods also assume that the length of the binding sites is known beforehand.

We first introduce a combinatorial approach where we count ℓ-mers that are either binding sites, background noise, or belong partly to both of these categories. The most common ℓ-mer in the data and its Hamming neighbours are declared as binding sites. Then an algorithm to align these binding sites in an unbiased manner is introduced. To avoid false binding sites, the fraction of signal in the data is estimated and used to subtract the counts that rise from the background.

The second approach has the following additional benefits. The division into signal and background is done in a rigorous manner using a maximumlike lihood method, thus avoiding the problems due to the ad hoc nature of the first approach. Secondly, use of a mixture model allows learning multiple models simultaneously. Then, subsequently, this mixture model is extended to include dimeric models as combinations of two binding models. We call this reduction of dimers as monomers modularity. This allows investigating the preference of each distance, even the negative distance in the overlapping case, and relative orientation between these two models. The most likely mixture model that explains the data is optimized using an EM algorithm. Since all the submodels belong to the same mixture model, their relative popularity can be directly compared. The mixture model gives an intuitive and unified view of the different binding modes of a single TF or a pair of TFs.

Implementations of all introduced algorithms, SeedHam and MODER for learning PPM models and MODER2 for learning ADM models, are freely available from GitHub. In validation experiments ADM models were observed to be slightly but consistently better than PPM models in explaining binding-site data. In addition, learning modularic mixture models confirmed many previously detected dimeric structures and gave new biological insights about different binding modes and their compact representations.

Availability of the dissertation

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-5602-0.

Printed copies will be available on request from Jarkko Toivonen: jarkko.toivonen@cs.helsinki.fi.