TY - GEN
T1 - Schema inference for massive JSON datasets
AU - Baazizi, Mohamed Amine
AU - Lahmar, Houssem Ben
AU - Colazzo, Dario
AU - Ghelli, Giorgio
AU - Sartiani, Carlo
N1 - Publisher Copyright:
© 2017, Copyright is with the authors.
PY - 2017/1/1
Y1 - 2017/1/1
N2 - In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, precision, and conciseness of inferred schemas, and scalability.
AB - In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, precision, and conciseness of inferred schemas, and scalability.
KW - Big data collections
KW - JSON
KW - Map-reduce
KW - Schema inference
KW - Spark
U2 - 10.5441/002/edbt.2017.21
DO - 10.5441/002/edbt.2017.21
M3 - Conference contribution
AN - SCOPUS:85044279455
T3 - Advances in Database Technology - EDBT
SP - 222
EP - 233
BT - Advances in Database Technology - EDBT 2017
A2 - Mitschang, Bernhard
A2 - Markl, Volker
A2 - Bress, Sebastian
A2 - Andritsos, Periklis
A2 - Sattler, Kai-Uwe
A2 - Orlando, Salvatore
PB - OpenProceedings.org
T2 - 20th International Conference on Extending Database Technology, EDBT 2017
Y2 - 21 March 2017 through 24 March 2017
ER -