TY - GEN
T1 - Counting types for massive JSON datasets
AU - Baazizi, Mohamed Amine
AU - Colazzo, Dario
AU - Ghelli, Giorgio
AU - Sartiani, Carlo
N1 - Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/9/1
Y1 - 2017/9/1
N2 - Type systems express structural information about data, are human readable and hence crucial for understanding code, and are endowed with a formal definition that makes them a fundamental tool when proving program properties. Internal data structures of a database store quantitative information about data, information that is essential for optimization purposes, but is not used for documentation or for correctness proofs. In this paper we propose a new idea: raising a part of the quantitative information from the system-level structures to the type level. Our proposal is motivated by the problem of schema inference for massive collections of JSON data, which are nowadays often collected from external sources and stored in NoSQL systems without an a-priori schema, which makes a-posteriori schema inference extremely useful. NoSQL systems are oriented towards the management of heterogeneous data, and in this context we claim that quantitative information is important in order to assess the relative weight of different variants. We propose a type system where the same collection can be described at different levels of abstraction. Different abstraction levels are useful for different purposes, hence we describe a parametric inference mechanism, where a single parameter specifies the chosen trade-off between succinctness and precision for the inferred type. This algorithm is designed for massive JSON collection, and hence admits a simple and efficient map-reduce implementation.
AB - Type systems express structural information about data, are human readable and hence crucial for understanding code, and are endowed with a formal definition that makes them a fundamental tool when proving program properties. Internal data structures of a database store quantitative information about data, information that is essential for optimization purposes, but is not used for documentation or for correctness proofs. In this paper we propose a new idea: raising a part of the quantitative information from the system-level structures to the type level. Our proposal is motivated by the problem of schema inference for massive collections of JSON data, which are nowadays often collected from external sources and stored in NoSQL systems without an a-priori schema, which makes a-posteriori schema inference extremely useful. NoSQL systems are oriented towards the management of heterogeneous data, and in this context we claim that quantitative information is important in order to assess the relative weight of different variants. We propose a type system where the same collection can be described at different levels of abstraction. Different abstraction levels are useful for different purposes, hence we describe a parametric inference mechanism, where a single parameter specifies the chosen trade-off between succinctness and precision for the inferred type. This algorithm is designed for massive JSON collection, and hence admits a simple and efficient map-reduce implementation.
KW - Descriptive schemas
KW - JSON
KW - Mapreduce
KW - Schema inference
KW - Type systems
U2 - 10.1145/3122831.3122837
DO - 10.1145/3122831.3122837
M3 - Conference contribution
AN - SCOPUS:85030534708
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 16th International Symposium on Database Programming Languages, DBPL 2017; Held in conjunction with VLDB 2017
PB - Association for Computing Machinery
T2 - 16th International Symposium on Database Programming Languages, DBPL 2017
Y2 - 1 September 2017
ER -