IEEE Big Data 2020 tutorial

The IEEE Big Data 2020 tutorial.

See a demo of UniBench as follow:


Big data system benchmarking enables practitioners and developers to assess the systems’ functionality and performance so that they can make wise decision to choose the proper big data systems, or improve them. As we are witnessing the emergence and evolvement of various benchmarks for big data systems, either in the form of macro-benchmark or micro-benchmark, it is crucial to thoroughly study, analyze, and understand the key techniques and applications of those benchmarks. In this tutorial, we offer a comprehensive presentation of a wide range of state-of-the-art benchmarks with a focus on big data systems. We classify these benchmarks into five categories: Map-Reduce based system benchmarking, SQL-based analytical system benchmarking, NoSQL-based database benchmarking, Big graph system benchmarking, and Multi-model database benchmarking. We discuss the key techniques of each approach, as well as the current practices. We also provide insights on the research challenges and directions for benchmarking different big data systems.

Related References:

Bajaber, Fuad, et al. "Benchmarking big data systems: A survey." Computer Communications 149 (2020): 241-251.

Rui Han, Lizy Kurian John, and Jianfeng Zhan. "Benchmarking big data systems: A review." IEEE Transactions on Services Computing 11, no. 3 (2017): 580-597.

Todor Ivanov, Tilmann Rabl, Meikel Poess, Anna Queralt, John Poelman, Nicolas Poggi, and Jeffrey Buell. "Big data benchmark compendium." In Technology Conference on Performance Evaluation and Benchmarking, pp. 135-155. Springer, Cham, 2015.

Baru, Chaitan, and Tilmann Rabl. "Tutorial 4" Big Data Benchmarking" at 2014 IEEE International Conference on Big Data. 2014."

Markus Dreseler, Martin Boissier, Tilmann Rabl, and Matthias Uflacker. "Quantifying TPC-H choke points and their optimizations." Proceedings of the VLDB Endowment 13, no. 8 (2020): 1206-1220.

Peter Boncz, Thomas Neumann, and Orri Erling. "TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark." In Technology Conference on Performance Evaluation and Benchmarking, pp. 61-76. Springer, Cham, 2013.

Meikel Poess, Tilmann Rabl, and Hans-Arno Jacobsen. "Analysis of TPC-DS: the first standard benchmark for SQL-based big data systems." In Proceedings of the 2017 Symposium on Cloud Computing, pp. 573-585. 2017.

Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. "BigBench: towards an industry standard benchmark for big data analytics." In Proceedings of the 2013 ACM SIGMOD international conference on Management of data, pp. 1197-1208. 2013.

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. "A comparison of approaches to large-scale data analysis." In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 165-178. 2009.

Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, and Alexander Rasin. "MapReduce and parallel DBMSs: friends or foes?." Communications of the ACM 53, no. 1 (2010): 64-71.

Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. "The HiBench benchmark suite: Characterization of the MapReduce-based data analysis." In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), pp. 41-51. IEEE, 2010.

Shi, Juwei, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. "Clash of the titans: Mapreduce vs. spark for large scale data analytics." Proceedings of the VLDB Endowment 8, no. 13 (2015): 2110-2121.

Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. "Benchmarking cloud serving systems with YCSB." In Proceedings of the 1st ACM symposium on Cloud computing, pp. 143-154. 2010.

Sanket Chintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Thomas Graves, Mark Holderbaugh, Zhuo Liu et al. "Benchmarking streaming computation engines: Storm, flink and spark streaming." In 2016 IEEE international parallel and distributed processing symposium workshops (IPDPSW), pp. 1789-1792. IEEE, 2016.

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao et al. "Bigdatabench: A big data benchmark suite from internet services." In 2014 IEEE 20th international symposium on high performance computer architecture (HPCA), pp. 488-499. IEEE, 2014.

Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, Andrey Gubichev, Arnau Prat, Minh-Duc Pham, and Peter Boncz. "The LDBC social network benchmark: Interactive workload." In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 619-630. 2015.

Alexandru Iosup, Tim Hegeman, Wing Lung Ngai, Stijn Heldens, Arnau Prat-Pérez, Thomas Manhardto, Hassan Chafio et al. "LDBC Graphalytics: A benchmark for large-scale graph analysis on parallel and distributed platforms." Proceedings of the VLDB Endowment 9, no. 13 (2016): 1317-1328.

Meikel Poess, Tilmann Rabl, Hans-Arno Jacobsen, and Brian Caufield. "TPC-DI: the first industry benchmark for data integration." Proceedings of the VLDB Endowment 7, no. 13 (2014): 1367-1378.

Jeyhun Karimov, Tilmann Rabl, and Volker Markl. "PolyBench: The first benchmark for polystores." In Technology Conference on Performance Evaluation and Benchmarking, pp. 24- 41. Springer, Cham, 2018.

Jiaheng Lu. "Towards benchmarking multi-model databases." CIDR, 2017.

Chao Zhang. Parameter Curation and Data Generation for Benchmarking Multi-model Queries. In VLDB 2018 Ph.D. workshop.

Chao Zhang, Jiaheng Lu, Pengfei Xu, and Yuxing Chen. UniBench: A Benchmark for Multi-model Database Management Systems. In TPCTC ’18, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers, volume 11135 of Lecture Notes in Computer Science, pages 7–23. Springer, 2018.

Chao Zhang and Jiaheng Lu. Holistic evaluation in multi-model databases benchmarking. Distributed and Parallel Databases, pages 1–33, 2019.

Mark Raasveldt, Pedro Holanda, Tim Gubner, and Hannes Mühleisen. "Fair benchmarking considered difficult: Common pitfalls in database performance testing." In Proceedings of the Workshop on Testing Database Systems, pp. 1-6. 2018.


Jiaheng Lu is a Professor at the University of Helsinki, Finland. His main research interests lie in big data management and database systems. He has published more than one hundred journals and conference papers. He has extensive experience in the industrial cooperations with IBM, Microsoft, and Huawei for the projects of NoSQL databases and performance tuning on distributed systems. He has published several books, on XML, Hadoop, and NoSQL databases. His book on Hadoop is one of the top-10 best-selling books in the category of computer software in China in 2013. He frequently serves as a PC member for conferences including SIGMOD, VLDB, ICDE, EDBT, CIKM, etc.

Chao Zhang is a senior Ph.D. candidate at the University of Helsinki (UH), Finland. Prior to joining UH, Chao spent one year at Renmin University of China (RUC) for Ph.D. studies. His research topic lies in multi-model database benchmarking and query optimization. He is the main contributor to the UniBench project that is the first benchmark for multi-model databases. He has published five journal and conference papers in the field of databases, with a focus on multi-model database benchmarking.

Benchmarking tutorial at IEEE Big Data 2020 conference: