Database and big data analytics systems such as Hadoop and Spark have a large number of configuration parameters that control memory distribution, I/O optimization, parallelism, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators struggle to understand and tune them to achieve good performance. In this tutorial, we review existing approaches on automatic parameter tuning for databases, Hadoop, and Spark, which we classify into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We describe the foundations of different automatic parameter tuning algorithms and present pros and cons of each approach. We also highlight real-world applications and systems and identify research challenges for handling cloud services, resource heterogeneity, and real-time analytics
The tutorial is planned for 1.5 hours and will have the following structure:
Part 1:
Motivation (10'). We motivate the need for automatic parameter tuning with several applications/scenarios in the era of Big Data and cloud environment.
History and classification (10'). We introduce the history and classification of parameter tuning approaches.
Pertinent references:
Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014) [PDF]
Part 2
Parameter tuning on Databases (20') We introduce six approaches to tune performance on database systems.
Pertinent references:
Part 3
Parameter tuning on Hadoop and Spark (35'). We introduce key approaches to tune performance on Hadoop MapReduce and Spark. We compare the solutions in various tuning categories.
Pertinent references:
Part 4
Open problem and challenges (15'). We discuss some real applications and systems for automatic tuning, such as Self-driving Oracle Database, Self-tuning DB2 and Unravel platform. We conclude with a discussion of open problems and challenges.
Jiaheng Lu is an Associate Professor at the University of Helsinki, Finland. His main research interests lie in the big data management and database systems, and specifically in the challenge of efficient data processing from real-life, massive data repository and Web. He has written four books on Hadoop and NoSQL databases, and more than 70 journal and conference papers published in SIGMOD, VLDB, TODS, and TKDE, etc.
Yuxing Chen is a doctoral student at the University of Helsinki. His research topics are parameter tuning on Big Data systems and multi-model query optimization.
Herodotos Herodotou is an Assistant Professor at the Cyprus University of Technology (CUT). His research interests are in large-scale data processing systems, database systems, and cloud computing. In particular, his work focuses on automated performance tuning of both centralized and distributed data-intensive computing systems. His Ph.D. dissertation work on the Starfish platform received the SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention as well as the Outstanding Ph.D. Dissertation Award in Computer Science at Duke.
Shivnath Babu is the CTO at Unravel Data Systems and an Adjunct Professor at Duke University. His research focuses on ease-of-use and manageability of data-intensive systems, automated problem diagnosis, and cluster sizing for applications running on cloud platforms. Shivnath cofounded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.
Duan, Songyun, Vamsidhar Thummala, and Shivnath Babu. "Tuning database configuration parameters with iTuned." Proceedings of the VLDB Endowment 2.1 (2009): 1246-1257. [PDF]
Babu, S., Borisov, N., Duan, S., Herodotou, H., & Thummala, V. (2009, May). Automated Experiment-Driven Management of (Database) Systems. In HotOS. [PDF]
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., & Babu, S. (2011, January). Starfish: A Self-tuning System for Big Data Analytics. In CIDR (Vol. 11, No. 2011, pp. 261-272). [PDF]
Herodotou, Herodotos, and Shivnath Babu. "Profiling, what-if analysis, and cost-based optimization of mapreduce programs." Proceedings of the VLDB Endowment 4.11 (2011): 1111-1122. [PDF]
Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., & Wang, C. (2014). MRTuner: a toolkit to enable holistic optimization for mapreduce jobs. Proceedings of the VLDB Endowment, 7(13), 1319-1330. [PDF]