Bella Zhukova defends her PhD thesis on New space-time trade-offs for pattern matching with compressed indexes

On Thursday the 22nd of February 2024, M.Sc. Bella Zhukova defends her PhD thesis on New space-time trade-offs for pattern matching with compressed. The thesis is related to research done in the Department of Computer Science and in the Compressed Data Structures team of the Algorithmic Bioinformatics group.

M.Sc. Bella Zhukova defends her doctoral thesis "New space-time trade-offs for pattern matching with compressed indexes" on Thursday the 22nd of February 2024 at 13 o'clock in the University of Helsinki Exactum building, Auditorium B123 (Pietari Kalmin katu 5, 1st floor). Her opponent is Professor Johannes Fischer (Technical University of Dortmund, Germany) and custos Professor Simon J. Puglisi (University of Helsinki). The defence will be held in English.

The thesis of Bella Zhukova is a part of research done in the Department of Computer Science and in the Compressed Data Structures team of the Algorithmic Bioinformatics group at the University of Helsinki. Her supervisor has been Professor Simon J. Puglisi (University of Helsinki).

New space-time trade-offs for pattern matching with compressed indexes

The string matching problem is a fundamental problem in computer science and related fields, which involves finding occurrences of a pattern or substring within a larger text. The problem arises in various application areas, such as bioinformatics, text processing, and information retrieval. Classic algorithms for this problem use large working space additional to the input itself, which can become a problem for large modern inputs. To address this, new indexes have emerged that require space close to the information theoretic lower bound, or, in the case of highly-repetitive inputs, even less, yet are still able to support queries efficiently.

In this thesis we design and implement new compressed data structures for various settings of string matching for highly repetitive inputs, such as those arising in bioinformatics and versioned text collections, such as source code. This work is based on five published papers.

Indexed string matching generally presupposes two operations: count, that returns the number of times a pattern occurs in a text, and locate, that lists the positions of those occurrences. In Paper I we consider a means to speed up the count operation of the FM-index data structure, which in turn depends only on the time to perform a rank query on the Burrows-Wheeler Transform of a text.

The next two papers focus on locate queries. In Paper II we describe and implement a very effective compressed index for suffix arrays based on Relative-Lempel Ziv (RLZ) compression, while Paper III improves the results reducing the space usage of the index by creating more efficient RLZ dictionaries and utilizing compact encodings for index components.

In Paper IV we investigate algorithms for document listing queries based on scanning document array intervals, and methods to compress document arrays in a way that still allows fast queries.

Finally, in Paper V we design and implement indexes that are capable of efficiently searching large data sets for patterns with variable-length gaps between subpatterns.

 

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-952-84-0088-2.

Printed copies will be available on request from Bella Zhukova: bella.zhukova@helsinki.fi.