PhD candidate Chao Zhang successfully defended his thesis

On Monday, 19th of April 2021, M.Sc. Chao Zhang successfully defended his doctoral thesis Performance Benchmarking and Query Optimization for Multi-Model Databases. The event has invited Professor Jiaheng Lu (University of Helsinki) as Custos and Associated Professor Kostas Stefanidis (Tampere University) as the Opponent. The Faculty representative and the grading chair is Professor Keijo Heljanko (University of Helsinki).

Performance Benchmarking and Query Optimization for Multi-Model Databases

Abstract:

Multi-Model DataBases (MMDBs) are database management systems that utilize a single platform to store, manipulate, and query data in various data models, e.g., relational, tree, and graph models. Numerous MMDBs have been developed to facilitate multi-model data management, but they differ fundamentally in data storage, query language, and query processing. Existing tools are not suitable for benchmarking MMDBs because they do not consider the multi-model data and workloads. Therefore, it is of paramount importance to provide a new benchmark to evaluate the performance of MMDBs. Furthermore, MMDBs bring new issues for query optimization as traditional techniques cannot properly optimize the queries due to the lack of consideration of multi-model operators and storage. Thus, it calls for new approaches to optimize multi-model queries.

This thesis is divided into two parts to accomplish the above two goals, respectively. In the first part, we developed a new benchmark system for MMDBs in a scenario of social commerce with five data models, i.e., relational, JSON, XML, graph, and key-value. This benchmark system is an end-to-end tool that consists of various components, including synthetic data generation, workload specification, parameter curation, and execution client. Moreover, we leveraged the developed system to conduct a holistic experimental evaluation of the state-of-the-art MMDBs to compare their performance and identify their performance bottlenecks in handling multi-model workloads. In the second part, we proposed two query optimization techniques for MMDBs. Firstly, we proposed a kernel density estimation (KDE)-based model to estimate the selectivity of multi-model joins that involve predicates for relational and tree data models. The estimation method can serve as a building block for selecting an optimal joining order in a cross-model query execution plan. We also proposed two approaches, the max-min approximation (MMA) and grid-based approximation (GBA) models, to approximate the KDE contribution while improving the estimation efficiency for large data samples. Secondly, we studied the problem of view selection in the relational-based graph databases to avoid the costly joins in the relational engine. Particularly, we proposed an end-to-end system that can automatically create, evaluate, and select views for accelerating the query processing. We proposed an extended graph view that can answer the subgraph and supergraph queries simultaneously. We devised a filtering-and-verification framework that enables the verification of the query containment by graph views. We formalized the view selection problem as a 0-1 Knapsack problem, then we developed a view selection algorithm, named graph gene algorithm (GGA), which explores the graph view transformations to reduce the view space and optimize the view benefit. Overall, the present thesis advances MMDBs in three aspects, i.e., performance benchmarking, join selectivity estimation, and automatic view selection.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-7198-6.