Who is predisposed to diabetes, which migraine drug would be perfect for you – algorithms are combing masses of data to find the answers
Algorithms are already seeking genes that make us susceptible to disease and helping us develop increasingly targeted medicine. Developers of new computational ideas and algorithms are increasingly needed to help develop new drugs.

In recent years, researchers have uncovered hundreds of gene mutations that underlie our susceptibility to diabetes and heart disease. However, it is currently not precisely known what the cause for these genetic factors is and how they function together.

 “Making genetics more accessible is becoming increasingly important, as the choice of treatment becomes more individualised. The goal is to harness the patient’s genetic makeup to choose the best drugs and treatments for them,” explains Matti Pirinen, assistant professor of statistical genomics.

Pirinen is looking for such mutations that influence susceptibility to disease at FIMM, the Institute for Molecular Medicine Finland, as part of the Centre of Excellence in Complex Disease Genetics, where he heads the research group in computational genomics. The group applies statistical machine learning to the analysis of genome data and population genetics. The group features experts in mathematics, statistics, and bioinformatics.

Researchers at the Centre of Excellence are seeking gene variants, or alleles, which influence our susceptibility to, for example, migraines, cardiovascular diseases, diabetes, inflammatory intestinal diseases and psychiatric disorders. Their research material includes a dataset with millions of gene markers derived from samples taken from thousands of individuals.

With such huge amounts of information, it would be impossible to come to any useful conclusions without machine learning algorithms which seek out reliable and repeating data.

“For example, we can look for cases which are genetically or otherwise similar to the affected individual. After that we’ll try to determine what kinds of treatment methods have or have not worked in previous cases. Realising an idea like this in a systematic and efficient manner requires machine learning algorithms which can classify complex data, and in the future, suggest potential treatment choices to physicians,” Pirinen explains.

Algorithms find risk genes

Pirinen’s group is currently working on two studies. The first one, conducted in collaboration with the research group headed by Aarno Palotie, is searching for alleles that are more common among a cohort of more than 100,000 migraine sufferers than they are among the healthy population. For this reason, Christian Benner, a member of Pirinen’s group, has developed the FINEMAP algorithm which can conduct variable selection to find out alleles that are biologically significant and have an impact on the risk of disease.

“An individual who carries a specific gene variant is more likely to contract a disease than a person with a different variant of the gene. However, only a small minority of gene variants with a statistical connection to disease has a direct biological link to its genesis. With the FINEMAP algorithm, we hope to find genetic factors that have a direct impact on the biology of the disease and that we can control with medication,” Pirinen explains.

The FINEMAP algorithm does not provide treatment recommendations, but it does make the early stage of drug development easier by uncovering alleles that are significant for disease. The algorithm can turn complex data into probabilities that are easier to study, and which can be used to evaluate the biological significance of each gene variant.

FINEMAP in action: The image shows the region around LIPC gene on chromosome 15 and the statistical association (y-axis) of each genomic position (x-axis) with HDL cholesterol levels. Top panel: FINEMAP results. Lower panel: results of a standard analysis. The triple that FINEMAP proposes as the causal configuration has a 190 times higher likelihood than the triple proposed by the standard analysis. 

Another ongoing study focuses on the geographical distribution in Finland of the alleles that are associated with susceptibility to illness.

 “We have been able to identify a large number of individual sections of the genome that have a minor impact on a particular disease. We are now trying to determine whether the geographical distribution of these variants within Finland displays differences that could account for the geographical variation of the likelihood of illness. This information is significant in the effort to effectively prevent and treat endemic diseases,” Pirinen explains.

In this study, tens of thousands of gene loci must be combined into sum variables and make sure that they genuinely describe susceptibility to illness and not just general genetic variety within Finland. The basic principles of machine learning are again needed, such as cross-validation or the separation of teaching and testing data.

Human touch still needed

Without algorithms, many future drugs would go undiscovered. However, while the algorithm does the groundwork, the results must be interpreted by a physician, who can determine whether the suggestions made by the artificial intelligence make any sense. Pirinen believes that we will see increasingly complex ways of using machine learning and artificial intelligence in drug discovery.

 “For example, we are making strides in connecting health register data with genome data. We are transitioning from a crude ‘sick or healthy’ classification into an increasingly precise description of the symptoms and traits of each individual. For this, we need a new generation of experts, capable of handling such complex data through machine learning and artificial intelligence,” says Pirinen.

See also: Master's Programme in Life Science Informatics

How does an algorithm sift through genes?

The FINEMAP software is based on the Shotgun Stochastic Search algorithm. Its purpose is to locate central influencers, such as alleles, from among millions of candidates.

The traditional way of looking for such targets is to comb through the entire dataset with an algorithm, a process that is extremely slow and not necessarily even feasible. FINEMAP’s speciality is the fact that it does not process all of the data. Nevertheless, it generates results of practically identical accuracy as an exhaustive algorithm.

The algorithm proceeds through all candidates using two basic functions: evaluating neighbouring candidates and selecting new candidates.

With each step, it generates neighbouring candidates by making minute changes to the current candidate. It selects the new candidate from among its neighbours based on the probability of each candidate being the specific group of significant alleles that can influence the onset of disease.

The algorithm keeps repeating the process until it can no longer find any more promising new candidates. At the end, the algorithm prints out all candidates it has encountered along with the likelihood that each one is the desired group of significant alleles. Researchers use these probabilities to evaluate which alleles should be selected for closer, further study.

The ideas of the FINEMAP algorithm can also be applied to variable selection problems in other fields.