A bipartite ranking approach to the two-sample problem

Stephan Clémençon, Myrto Limnios, Nicolas Vayatis

Research output: Contribution to journalArticlepeer-review

Abstract

The two-sample problem consists in testing whether two independent samples are drawn from the same (unknown) distribution. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various poorly controlled sources, leading to datasets possibly exhibiting strong sampling bias. While the efficiency of classic methods relying on computing a discrepancy measure between the empirical distributions of each sample, is negatively impacted by increasing dimensionality, we develop a two-step approach based on statistical learning and an extension of rank tests. By dividing the initial samples in two, a bipartite ranking algorithm first learns a real-valued scoring function inducing a preorder on the multivariate space. Then, a rank statistic based on the scores of the remaining observations, tests for differences in distribution. Because the ranking algorithm learns how to map the data onto the real line as the likelihood ratio between the original multivariate distributions, the approach resists to large dimensions (ignoring ranking model bias issues) and preserves the advantages of univariate rank tests.We prove nonasymptotic error bounds based on recent results for two-sample linear rank-processes, and experimentally show how the promoted approach surpasses state-of-the-art methods. MSC2020 subject classifications: Primary 62G10, 68Q32; secondary 60E15, 62C12.

Original languageEnglish
Pages (from-to)2733-2779
Number of pages47
JournalElectronic Journal of Statistics
Volume19
Issue number1
DOIs
Publication statusPublished - 1 Jan 2025

Keywords

  • Bipartite ranking
  • nonparametric statistical hypothesis testing
  • two-sample linear rank statistic/process
  • two-sample problem

Fingerprint

Dive into the research topics of 'A bipartite ranking approach to the two-sample problem'. Together they form a unique fingerprint.

Cite this