Schema inference for massive JSON datasets

Mohamed Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the recent years JSON affirmed as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out the structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, precision, and conciseness of inferred schemas, and scalability.

Original languageEnglish
Title of host publicationAdvances in Database Technology - EDBT 2017
Subtitle of host publication20th International Conference on Extending Database Technology, Proceedings
EditorsBernhard Mitschang, Volker Markl, Sebastian Bress, Periklis Andritsos, Kai-Uwe Sattler, Salvatore Orlando
PublisherOpenProceedings.org
Pages222-233
Number of pages12
ISBN (Electronic)9783893180738
DOIs
Publication statusPublished - 1 Jan 2017
Externally publishedYes
Event20th International Conference on Extending Database Technology, EDBT 2017 - Venice, Italy
Duration: 21 Mar 201724 Mar 2017

Publication series

NameAdvances in Database Technology - EDBT
Volume2017-March
ISSN (Electronic)2367-2005

Conference

Conference20th International Conference on Extending Database Technology, EDBT 2017
Country/TerritoryItaly
CityVenice
Period21/03/1724/03/17

Keywords

  • Big data collections
  • JSON
  • Map-reduce
  • Schema inference
  • Spark

Fingerprint

Dive into the research topics of 'Schema inference for massive JSON datasets'. Together they form a unique fingerprint.

Cite this