Learning and Data Selection in Big Datasets

  • Hossein S. Ghadikolaei
  • , Hadi Ghauch
  • , Carlo Fischione
  • , Mikael Skoglund

Research output: Contribution to journalConference articlepeer-review

Abstract

Finding a dataset of minimal cardinality to characterize the optimal parameters of a model is of paramount importance in machine learning and distributed optimization over a network. This paper investigates the compressibility of large datasets. More specifically, we propose a framework that jointly learns the input-output mapping as well as the most representative samples of the dataset (sufficient dataset). Our analytical results show that the cardinality of the sufficient dataset increases sub-linearly with respect to the original dataset size. Numerical evaluations of real datasets reveal a large compressibility, up to 95%, without a noticeable drop in the learnability performance, measured by the generalization error.

Original languageEnglish
Pages (from-to)2191-2200
Number of pages10
JournalProceedings of Machine Learning Research
Volume97
Publication statusPublished - 1 Jan 2019
Event36th International Conference on Machine Learning, ICML 2019 - Long Beach, United States
Duration: 9 Jun 201915 Jun 2019

Fingerprint

Dive into the research topics of 'Learning and Data Selection in Big Datasets'. Together they form a unique fingerprint.

Cite this