Restoring balance: principled under/oversampling of data for optimal classification

Emanuele Loffredo, Mauro Pastore, Simona Cocco, Rémi Monasson

Research output: Contribution to journalConference articlepeer-review

Abstract

Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on underrepresented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

Original languageEnglish
Pages (from-to)32643-32670
Number of pages28
JournalProceedings of Machine Learning Research
Volume235
Publication statusPublished - 1 Jan 2024
Externally publishedYes
Event41st International Conference on Machine Learning, ICML 2024 - Vienna, Austria
Duration: 21 Jul 202427 Jul 2024

Fingerprint

Dive into the research topics of 'Restoring balance: principled under/oversampling of data for optimal classification'. Together they form a unique fingerprint.

Cite this