Phd candidate Gongsheng Yuan successfully defended his thesis

On Monday, the 9th of May 2022, M.Sc. Gongsheng Yuan successfully defended his doctoral thesis "Keyword Searches and Schema Transformation for Multi-Model Databases". The event has invited Professor Jiaheng Lu (University of Helsinki) as Custos and Associate Professor Georgios J. Fakas (Uppsala University, Sweden) as the Opponent. The Faculty representative and the grading chair is Professor Jyrki Kivinen (University of Helsinki).

Keyword Searches and Schema Transformation for Multi-Model Databases

Abstract:

The "Variety" of data is promoting the evolution and development of databases. One of the influence results is the emergence of multi-model databases, whose core idea is to utilize a single and unified platform to manage well-structured data and NoSQL data. So far, the database community has proposed quite a few multi-model databases to support different data models (e.g., relational, JSON, and graph models), but these databases adopt diverse methods to implement their data storage and query, which results in a heavy burden for novices to use multi-model databases. This is because there is no unified standard of multi-model query languages (like SQL). Users have to master different query languages to operate corresponding multi-model databases. And users also need to know the complicated and probably evolving schema of multi-model data as background knowledge for writing the proper query statements.

Considering these situations, we present our first research topic - how to employ the keyword searches method as an alternative way to explore and query multi-model databases. The reason is that empowering users to access multi-model databases with simple keywords can relieve users from the steep learning curve of mastering query languages and schemas of multi-model data. Besides, compared with the mature and robust relational databases dominating the current market, multi-model databases - could not yet match them in transaction management, query optimization, security, etc. - still need time to perfect their foundations of the mathematic theory and boost performance. Considering this, we present our second research topic - how to use relational databases as an alternative way to store and query well-structured data and NoSQL data uniformly.

For the first research problem, we utilize the probabilistic formalism of quantum physics to bring the problem into vector spaces and exploit non-classical probabilities to find top-k the most relevant results, in which each result may consist of multiple components - from different data models - corresponding to pertinent information. In this process, we apply the quantum language model to represent events (e.g., words) as subspaces, employ density matrices to encapsulate all the information over these subspaces, and use these density matrices to measure the divergence between a query and candidate results. Moreover, we propose the density vector by analyzing the quantum language model to reduce computation complexity. To construct density vectors, we propose using spatial pattern mining technology to identify superposition events (i.e., compounds) for improving method accuracy. We also make use of the Principle Component Analysis (PCA) method to further improve the efficiency of keyword searches over multi-model databases by reducing query calculation costs. Now, we could make keyword searches over multi-model databases work.

As for the second research topic, it requires designing a good relational schema to store these various data in relational databases. But the challenge is that we need to address the difference of structure between flat relational tables and complex multi-model data. To address this problem, we review all relevant works, analyze existing methods, and give a literature review. As a result, we find these works focusing on handling one single data model by relational databases. There is no relevant research to handle multi-model data. Against this challenge, we prepare to employ the reinforcement learning method. This is because this method could automatically obtain an excellent relational schema from the given multi-model data and queries by interacting with the outer environment. To make this idea work in the field of databases, we define the input, goal, reward, policy, and observation according to our purpose, respectively. Besides, we present a Double Q-tables algorithm to assist in decreasing the complexity of the learning process.    

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-8126-8.