TY - JOUR
T1 - A bipartite ranking approach to the two-sample problem
AU - Clémençon, Stephan
AU - Limnios, Myrto
AU - Vayatis, Nicolas
N1 - Publisher Copyright:
© 2025, Institute of Mathematical Statistics. All rights reserved.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - The two-sample problem consists in testing whether two independent samples are drawn from the same (unknown) distribution. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various poorly controlled sources, leading to datasets possibly exhibiting strong sampling bias. While the efficiency of classic methods relying on computing a discrepancy measure between the empirical distributions of each sample, is negatively impacted by increasing dimensionality, we develop a two-step approach based on statistical learning and an extension of rank tests. By dividing the initial samples in two, a bipartite ranking algorithm first learns a real-valued scoring function inducing a preorder on the multivariate space. Then, a rank statistic based on the scores of the remaining observations, tests for differences in distribution. Because the ranking algorithm learns how to map the data onto the real line as the likelihood ratio between the original multivariate distributions, the approach resists to large dimensions (ignoring ranking model bias issues) and preserves the advantages of univariate rank tests.We prove nonasymptotic error bounds based on recent results for two-sample linear rank-processes, and experimentally show how the promoted approach surpasses state-of-the-art methods. MSC2020 subject classifications: Primary 62G10, 68Q32; secondary 60E15, 62C12.
AB - The two-sample problem consists in testing whether two independent samples are drawn from the same (unknown) distribution. Its study in high-dimension is the subject of much attention, especially because the information acquisition processes at work in the Big Data era often involve various poorly controlled sources, leading to datasets possibly exhibiting strong sampling bias. While the efficiency of classic methods relying on computing a discrepancy measure between the empirical distributions of each sample, is negatively impacted by increasing dimensionality, we develop a two-step approach based on statistical learning and an extension of rank tests. By dividing the initial samples in two, a bipartite ranking algorithm first learns a real-valued scoring function inducing a preorder on the multivariate space. Then, a rank statistic based on the scores of the remaining observations, tests for differences in distribution. Because the ranking algorithm learns how to map the data onto the real line as the likelihood ratio between the original multivariate distributions, the approach resists to large dimensions (ignoring ranking model bias issues) and preserves the advantages of univariate rank tests.We prove nonasymptotic error bounds based on recent results for two-sample linear rank-processes, and experimentally show how the promoted approach surpasses state-of-the-art methods. MSC2020 subject classifications: Primary 62G10, 68Q32; secondary 60E15, 62C12.
KW - Bipartite ranking
KW - nonparametric statistical hypothesis testing
KW - two-sample linear rank statistic/process
KW - two-sample problem
UR - https://www.scopus.com/pages/publications/105008376181
U2 - 10.1214/25-EJS2392
DO - 10.1214/25-EJS2392
M3 - Article
AN - SCOPUS:105008376181
SN - 1935-7524
VL - 19
SP - 2733
EP - 2779
JO - Electronic Journal of Statistics
JF - Electronic Journal of Statistics
IS - 1
ER -