CIKM 2020 TUTORIAL: Multi-Model Data Query Languages and Processing Paradigms

Abstract:

Specifying users' interests with a formal query language is a typically challenging task, which becomes even harder in the context of multi-model data management because we have to deal with data variety. It usually lacks a unified schema to help the users issuing their queries, or has an incomplete schema as data come from disparate sources. Multi-Model DataBases (MMDBs) have emerged as a promising approach for dealing with this task as they are capable of accommodating and querying the multi-model data in a single system. This tutorial aims to offer a comprehensive presentation of a wide range of query languages for MMDBs and to make comparisons of their properties from multiple perspectives. We will discuss the essence of cross-model query processing and provide insights on the research challenges and directions for future work. The tutorial will also offer the participants hands-on experience in applying MMDBs to issue multi-model data queries.

Webpage for the hands-on instructions:

https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/blob/master/hands-on.ipynb

Outline:

The tutorial is planned for 3 hours and is divided into 6 parts as follows:

Part I: Introduction(15 minutes)

We start the tutorial by introducing data variety and motivating the need for multi-model data management.

1.1 Basics on data variety

1.2 The need and essence for multi-model data management

Related References:

  • E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 13(6):377–387, 1970.
  • A. Deutsch and Y. Papakonstantinou. Graph data models, query languages and programming paradigms. Proc. VLDB Endow., 11(12):2106–2109, 2018.
  • J. Lu and I. Holubová. Multi-model data management: What’s new and what’s next? In Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21-24, 2017, pages 602–605. OpenProceedings.org, 2017.
  • J. Lu and I. Holubová. Multi-model Databases: A new journey to handle the variety of data. ACM Computing Surveys, 52(3), 2019.

Part II: Data models (15 minutes)

We will briefly discuss the major data models adopted by database systems.

2.1 The relational model

2.2 Extensions of the relational model

2.3 The semi-structured data models such as XML and JSON

2.4 The graph data models

Related References:

  • S. Abiteboul and C. Beeri. On The Power Of Languages For The Manipulation Of Complex Objects. Technical Report 846, INRIA, Paris, May 1988.
  • R. Angles, M. Arenas, P. Barceló, P. A. Boncz, G. H. L. Fletcher, C. Gutierrez, T. Lindaaker, M. Paradies, S. Plantikow, J. F. Sequeda, O. van Rest, and H. Voigt.
  • G-CORE: A core for future graph query languages. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 1421–1432. ACM, 2018.
  • E. F. Codd. Derivability, redundancy and consistency of relations stored in large data banks. Research Report / RJ / IBM / San Jose, California, RJ599, August 1969.
  • E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 13(6):377–387, 1970.
  • E. F. Codd. Extending the database relational model to capture more meaning. ACM Trans. Database Syst., 4(4):397–434, Dec. 1979.
  • A. Deutsch, Y. Xu, M. Wu, and V. Lee. Tigergraph: A native MPP graph database. CoRR, abs/1901.08248, 2019.
  • O. Hartig and J. Pérez. Semantics and complexity of GraphQL. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 1155–1164, Republic and Canton of Geneva, CHE, 2018. International World Wide Web Conferences Steering Committee.
  • I. Robinson, J. Webber, and E. Eifrem. Graph Databases: New Opportunities for Connected Data. O’Reilly Media, Inc., 2nd edition, 2015.
  • M. A. Rodriguez. The Gremlin Graph Traversal Machine and Language. CoRR, abs/1508.03843, 2015.
  • M. H. Scholl. Extensions to the Relational Data Model. In Conceptual Modelling, Databases and CASE: An Integrated View of Information Systems Development. Jon.Wiley & Sons, 1992.
  • M. H. Scholl, H. Paul, and H. Schek. Supporting flat relations by a nested relational kernel. In VLDB’87, Proceedings of 13th International Conference on Very Large Data Bases, September 1-4, 1987, Brighton, England, pages 137–146. Morgan Kaufmann, 1987.

Part III: Multi-model data query languages (45 minutes)

We will discuss several well-known multi-model data query languages, which fall into three categories. We will also provide an E-commence dataset (e.g., Unibench) and a detailed instruction for the participates to write and run some multi-model queries by using ArangoDB databases to provide them hands-on experience.

3.1 The SQL-extensions

3.2 The XML/JSON-extensions

3.3 The graph-extensions

Related References:

  • S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. R. Borkar, Y. Bu, M. J. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, E. Gabrielova, R. Grover, Z. Heilbron, Y. Kim, C. Li, G. Li, J. M. Ok, N. Onose, P. Pirzadeh, V. J. Tsotras, R. Vernica, J. Wen, and T. Westmann. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow., 7(14):1905–1916, 2014.
  • R. Angles, M. Arenas, P. Barceló, A. Hogan, J. L. Reutter, andD. Vrgoc. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5):68:1–68:40, 2017.
  • K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ semi-structured data model and query language: A capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631, 2014.
  • P. T. Wood. Query languages for graph databases. SIGMOD Rec., 41(1):50–60, 2012.

Part IV: Comparison of the multi-model query languages (45 minutes)

We will make a comparative study of the query languages from 4 perspectives such as semantic difference, expressibility, the internal representation, and the manner of query evaluation.

4.1 The semantic difference

4.2 The expressive power

4.3 The internal representation

4.4 The manners of query evaluation

Related References:

  • E. F. Codd. A Data Base Sublanguage Founded on the Relational Calculus. In Proceedings of the 1971 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control, SIGFIDET ’71, pages 35–68, New York, NY, USA, 1971. Association for Computing Machinery.
  • E. F. Codd. A data base sublanguage founded on the relational calculus. In Proceedings of the 1971 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control, SIGFIDET ’71, pages 35–68, New York, NY, USA, 1971. Association for Computing Machinery.
  • E. F. Codd. Relational completeness of data base sublanguages. Research Report /RJ / IBM / San Jose, California, RJ987, 1972.
  • J. Marton, G. Szárnyas, and D. Varró. Formalising openCypher Graph Queries in Relational Algebra. In Advances in Databases and Information Systems - 21st European Conference, ADBIS 2017, Nicosia, Cyprus, September 24-27, 2017, Proceedings, volume 10509 of Lecture Notes in Computer Science, pages 182–196. Springer, 2017.
  • V. Z. Moffitt and J. Stoyanovich. Temporal graph algebra. In Proceedings of The 16th International Symposium on Database Programming Languages, DBPL 2017, Munich, Germany, September 1, 2017, pages 10:1–10:12. ACM, 2017.
  • A. Mokhov. Algebraic graphs with class (functional pearl). In Proceedings of the 10th ACM SIGPLAN International Symposium on Haskell, Oxford, United Kingdom, September 7-8, 2017, pages 2–13. ACM, 2017.
  • M. Negri, G. Pelagatti, and L. Sbattella. Formal Semantics of SQL Queries. ACM Trans. Database Syst., 16(3):513–534, Sept. 1991.
  • K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ semi-structured data model and query language: A capabilities survey of SQL-on-Hadoop, NoSQL and NewSQL databases. CoRR, abs/1405.3631, 2014.
  • M. A. Rodriguez. The Gremlin Graph Traversal Machine and Language. CoRR, abs/1508.03843, 2015.
  • H. Thakkar, D. Punjani, S. Auer, and M. Vidal. Towards an integrated graph algebra for graph pattern matching with Gremlin. In Database and Expert Systems Applications - 28th International Conference, DEXA 2017, Lyon, France, August 28-31, 2017, Proceedings, Part I, volume 10438 of Lecture Notes in Computer Science, pages 81–91. Springer, 2017.
  • A. M. Turing. On computable numbers, with an application to the Entscheidungs problem. Proceedings of the London Mathematical Society, 2(42):230–265, 1936.

Part V: Open problem and challenges (5 minutes)

We then conclude with a discussion of open problems and challenges in designing multi-model data query languages.

5.1 Design an algebra for a multi-model query language.

5.2 General approaches for cross-model query processing.

Part VI: Hands-on session with ArangoDB (45 minutes)

We will invite the participants to write and run some multi-model queries by using ArangoDB.

6.1 Generate an E-commence dataset with Unibench

6.2 Hands-on experience for multi-model queries with ArangoDB.

Detailed instructions:

https://version.helsinki.fi/chzhang/cikm-2020-hands-on-session-for-multi-model-queries/-/blob/master/hands-on.ipynb

Related References:

  • ArangoDB. https://www.arangodb.com/.
  • ArangoDB Query Language(AQL). https://www.arangodb.com/docs/stable/aql/index.html.
  • C. Zhang and J. Lu. Holistic evaluation in multi-model databases benchmarking. Distributed and Parallel Databases, pages 1–33, 2019.
  • C. Zhang, J. Lu, P. Xu, and Y. Chen. UniBench: A Benchmark for Multi-model Database Management Systems. In TPCTC ’18, Rio de Janeiro, Brazil, August 27-31, 2018, Revised Selected Papers, volume 11135 of Lecture Notes in Computer Science, pages 7–23. Springer, 2018.

Presenters:

Qingsong Guo is a postdoctoral researcher at the University of Helsinki, Finland. He received Ph.D. degree at the University of Southern Denmark in 2016. His current research interests include multi-model data management and learning to manage big data with deep learning algorithms.

Jiaheng Lu is a professor at the University of Helsinki, Finland. His main research interests lie in the Big Data management and database systems. He has published more than one hundred journal and conference papers. He has published several books on XML, Hadoop, and NoSQL databases. He has given several tutorials on multi-model data management and autonomous databases in VLDB, CIKM, and EDBT conferences. He frequently serves as a PC member for conferences including SIGMOD, VLDB, ICDE, EDBT, CIKM, etc.

Chao Zhang is a Ph.D. candidate at the Department of Computer Science, University of Helsinki (UH). His research topic is multi-model database benchmarking and query optimization. Prior to joining UH, Chao spent one year at Renmin University of China (RUC) for Ph.D. studies.

Calvin Sun is Chief Database Architect at Huawei Cloud. He has 20+ years of working experience in the development of several database systems, ranging from the embedded database, large-scale distributed database, to cloud-native database. Calvin joined Huawei Toronto Distributed Scheduling and Data Engine Lab in October 2017. Prior to joining Huawei, he was a consulting member of technical staff at Oracle Cloud. He also served as manager of the storage engines team at MySQL Inc., manager of the InnoDB team at Oracle, and manager of MySQL development at Twitter.

Steven Yuan is the director of Huawei Toronto Distributed Scheduling and Data Engine Lab. He leads an over 30 people research team in big data and cloud domain. More specifically, his lab research focuses on distributed scheduling, from IaaS to PaaS, and Distributed Database as a service. Before joined Huawei in Aug 2014, Steven was a senior manager and had 18 years of working experience in IBM HPC product LSF and Symphony. Steven is an expert in distributed resource management and scheduling was an inventor of 4 U.S. patents in the SLA scheduling and job placement. Steven got his Ph.D from Peking University in 1995. In the following year, he did his post-doc research in the large scale heterogeneous distributed computing field following Prof. Songnian Zhou at the University of Toronto.