Computer science research is also furthered by failure

Looking for minute aberrations in the genome is hard work, even though computer science has spent decades coming up with solutions for genomic information retrieval. Veli Mäkinen knows that proofs of impossibility can also benefit the scientific community.

How does it feel when research findings are not what was hoped for? According to Veli Mäkinen, Professor of computer science, it feels good.

For close to a decade, Mäkinen and his research group have been looking for a better way of finding alterations in human DNA. Such variants make up the genome of individuals, in addition to which they may influence the development of diseases.

The project came to a close recently, ending in an impasse of sorts: developing an efficient search algorithm is an impossibility without solving another difficult problem associated with computer science. The results are to be published in July 2019 at the ICALP conference. The project is joint work with Professor Roberto Grossi from the University of Pisa, PhD student Massimo Equi, and University Researcher Alexandru Tomescu.  

How did this come about?

Looking for targeted therapies

Exploring certain DNA sequences is a basic tool of drug development and the identification of hereditary diseases, among other activities. With DNA data, researchers are developing increasingly individual cancer therapies that take into account individual genomes.

Traditionally, specific sections of DNA have been observed from sequenced DNA, entered into a computer as a single long string of characters. From this string, search algorithms look for the desired sequences much like Google combs through the universe of text on the internet. This is something that was already done in the 1970s.

What has also already been known for decades is that when the string of characters gets longer, searches get more cumbersome and the limits of computing power are quickly reached. 

Furthermore, the genome subjected to searches is not an exact depiction of anyone’s genes. Rather, it is an approximation considered accurate enough by the scientific community.

“Actually, the reference genome can be quite far from the actual genome of individuals. This way, a lot of information on genomic variation important to drug development is lost,” Mäkinen explains.

This is why researchers are interested in genomic variants and why more accurate search methods are under development.

Experimenting with a variation graph

To better identify genomic variants, novel methods of looking for them emerged alongside the traditional one roughly 10 years ago. One of them entails instructing a search algorithm to look for DNA data from what is known as a variation graph.

Variation graphs are a way to present information in a manner that efficiently compacts data, which makes them well suited to presenting genomic data. 

variaatioverkko

The genomes of different individuals are very similar. There is no sense in illustrating the genomes of millions of people in separate strings of characters; rather, the focus should be on determining their differences. In order to avoid an uncontrollable deluge of data, the nucleotide sequences common to all should be presented as a single string, like in this picture. Once differences between individuals are encountered, the variation graph branches off, allowing the search algorithm to find the variants. Getting back to identical genomic regions, the characters resume their single compact string. The search string pictured is ‘voting’ for two variants.

Result: impossible

Mäkinen's research group attempted to develop a search algorithm to look for specific DNA strings using such a variation graph as efficiently as traditional algorithms focused on single genomes.

It became apparent that this was impossible. The most obvious kind of search algorithm, one that jumps from each part of the graph to the next and compares characters, proved to be the best possible solution. As the length of character strings grows, running such searches starts to consume so much computing power that with currently available equipment, the calculations would take hundreds of days.

“Accurate searches using a variation graph were just as difficult as those carried out with the traditional method based on the reference genome, which allows for errors. Our findings indicate that there is not much hope of finding a meaningfully better search method,” Mäkinen explains.

Had Mäkinen’s group succeeded in improving search results, it would have been a big deal in the field of computer science, a significant advance involving a classic problem of the discipline, the P versus NP problem.

Lighting the way for colleagues

Still, Mäkinen is not disappointed with the result. Far from it – after all, this is what basic research is all about.

“Basic research advances in small steps. In computer science, something is first demonstrated to be difficult, after which you start narrowing down the field of problems, just like we did. Over time, other researchers may make other discoveries, further closing in on the problem,” Mäkinen says.

According to Mäkinen, researchers across the globe are now slightly better versed in the limitations of genomic information retrieval, a little bit less in the dark than before. Next, researchers should strive to identify in variation graphs features that would not be caught up in known impossibilities.

Mäkinen’s group and his collaborators are also carrying on with this work.

“Of course, it would be nice to skip over these interim stages, but practical research takes time. By proving that something is or is not possible, results start to accumulate, taking us forward. Somewhere down the line, a solution is waiting.”

Read more:

csm_makinen_veli

Veli Mäkinen

Professor of computer science