A statistical view of clustering performance through the theory of U-processes

Research output: Contribution to journalArticlepeer-review

Abstract

Many clustering techniques aim at optimizing empirical criteria that are of the form of a U-statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of U-processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the excess of clustering risk of the empirical minimizer is proved to be of the order OP(1/n). A lower bound result shows that the rate obtained is optimal in a minimax sense. Based on recent results related to the tail behavior of degenerate U-processes, it is also shown how to establish tighter, and even faster, rate bounds under additional assumptions. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. Finally, it is explained how the theoretical results developed here can provide statistical guarantees for empirical clustering aggregation.

Original languageEnglish
Pages (from-to)42-56
Number of pages15
JournalJournal of Multivariate Analysis
Volume124
DOIs
Publication statusPublished - 1 Feb 2014
Externally publishedYes

Keywords

  • Cluster analysis
  • Empirical risk minimization
  • Fast rates
  • Median clustering
  • Minimax lower bound
  • Pairwise dissimilarity
  • U-process

Fingerprint

Dive into the research topics of 'A statistical view of clustering performance through the theory of U-processes'. Together they form a unique fingerprint.

Cite this