Pengfei Xu defends his PhD thesis on Efficient Approximate String Matching with Synonyms and Taxonomies

On Friday the 19th of February 2021, M.Sc. Pengfei Xu will defend his doctoral thesis on Efficient Approximate String Matching with Synonyms and Taxonomies. The thesis is related to research done in the Department of Computer Science and in the Unified Database Management Systems group at the University of Helsinki.

M.Sc. Pengfei Xu defends his doctoral thesis Efficient Approximate String Matching with Synonyms and Taxonomies on Friday the 19th of February 2021 at 14 o'clock in the University of Helsinki Exactum building, Auditorium CK112 (Pietari Kalmin katu 5, Basement). His opponent is Professor Jan Holub (Czech Technical University in Prague, Czech Republic) and custos Professor Jiaheng Lu (University of Helsinki). The defence will be held in English. It is possible to follow the defence as a live stream at https://helsinki.zoom.us/j/65651477920?pwd=K0NibW9OVTNpWGZCcnJ1M2d3NGtEUT09.

The thesis of Pengfei Xu is a part of research done in the Department of Computer Science and in the Unified Database Management Systems group at the University of Helsinki. His supervisor has been Professor Jiaheng Lu (University of Helsinki).

Efficient Approximate String Matching with Synonyms and Taxonomies

Strings are ubiquitous. When being collected from various sources, strings are often inconsistent, which means that they can have the same or similar meaning expressed in different forms, such as with typographical mistakes. Finding similar strings given such inconsistent datasets has been researched extensively during past years under an umbrella problem called approximate string matching.

This thesis aims to enhance the quality of the approximate string matching by detecting similar strings using their meanings besides typographical errors. Specifically, this thesis focuses on utilising synonyms and taxonomies, since both are commonly available knowledge sources. This research is to use each type of knowledge to address either a selection or join tasks, where the first task aims to find strings similar to a given string, and the second task is to find pairs of strings that are similar. The desired output is either all strings similar to a given extent (i.e., all-match) or the top-k most similar strings.

The first contribution of this thesis is to address the top-k selection problem considering synonyms. Here, we propose algorithms with different optimisation goals: to minimise the space cost, to maximise the selection speed, or to maximise the selection speed under a space constraint. We model the last goal as a variant of an 0/1 knapsack problem and propose an efficient solution based on the branch and bound paradigm.

Next, this thesis solves the top-k join problem considering taxonomy relations. Three algorithms, two based on sorted lists and one based on tries, are proposed, in which we use pre-computations to accelerate list scan or use predictions to eliminate unnecessary trie accesses. Experiments show that the trie-based algorithm has a very fast response time on a vast dataset.

The third contribution of this thesis is to deal with the all-match join problem considering taxonomy relations. To this end, we identify the shortcoming of a standard prefix filtering principle and propose an adaptive filtering algorithm that is tuneable towards the minimised join time. We also design a sampling-based estimation procedure to suggest the best parameter in a short time with high accuracy.

Lastly, this thesis researches the all-match join task by integrating typographical errors, synonyms, and taxonomies simultaneously. Key contributions here include a new unified similarity measure that employs multiple measures, as well as a non-trivial approximation algorithm with a tight theoretical guarantee. We furthermore propose two prefix filtering principles: a fast heuristic and accurate dynamic programming, to strive for the minimised join time.

Availability of the dissertation

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-6988-4.

Printed copies will be available on request from Pengfei Xu: pengfei.xu@helsinki.fi.