Performance Tuning and Query Optimization for Big Data Management
The substantial increase of real-life applications creates a large scale of ever-increasing raw heterogeneous data nowadays, correlating to the four Vs characteristics of big data: volume, variety, velocity and veracity. We discuss volume and variety challenges in this thesis. For volume, efficiently extracting valuable information and making predictions from these large-scale data are interesting to various quarters from academic researchers and industrial data scientists to customers and shareholders. For variety, much research addresses the challenges of effectively storing, collecting, processing, querying, and analyzing heterogeneous data.
This thesis pushes approaches to optimize the performance with volume and variety challenges. For volume challenges, we aim at performance tuning for big data systems. In this part, to tackle cold-start situations with no statistics for models, we leverage cost-model and triangulation to model the performance, thus leading to cost-effective prediction. For variety challenges, we aim at optimizing join queries. In this part, to fill the gap of little research on join queries with heterogeneous data (i.e., relational and tree data), we research the size bound and the worst-case optimal join algorithm with relational and tree data in contrast with only relations.
For parameter tuning, this thesis first contributes to propose a cost model for Spark workloads, which leverages Monte Carlo simulation to achieve cost-effective training. Specifically, we utilize a little part of resources and data to predict dependable performance for larger clusters and datasets even with data skewness and runtime deviations. Particularly, this work considers network and disk bounds so that it performs better with I/O-bounded workloads. Next, the thesis proposes d-simplexed, which models the Spark workloads by leveraging Delaunay Triangulation. Unlike other black-box ML methods, d-simplexed utilizes piece-wise linear regression models, which can be built faster and yield better prediction. Also, d-simplexed is built with an adaptive sampling technique which collects few training points but achieves accurate prediction.
For join queries, this thesis studies the worst-case optimal join with relational and tree data. To this end, we first embark the study on the cross-model conjunctive query (CMCQ) with relational and tree data, and formally define the problem of CMCQ processing. We reveal that the computation of the worst-case size bound of a CMCQ is NP-hard w.r.t query expression complexity. We then develop a worst-case optimal join algorithm called CMJoin to match the size bound of a CMCQ under some circumstances.
Availability of the dissertation
An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-7740-7.