Finding near neighbors through cluster pruning

Flavio Chierichetti, Alessandro Panconesi, Prabhakar Raghavan, Mauro Sozio, Alessandro Tiberi, Eli Upfal

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation. Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memory.

Original languageEnglish
Title of host publicationProceedings of the Twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
PublisherAssociation for Computing Machinery
Pages103-112
Number of pages10
ISBN (Print)1595936858, 9781595936851
DOIs
Publication statusPublished - 11 Jun 2007
Externally publishedYes
Event26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007 - Beijing, China
Duration: 11 Jun 200713 Jun 2007

Publication series

NameProceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
ISSN (Print)1055-6338

Conference

Conference26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2007
Country/TerritoryChina
CityBeijing
Period11/06/0713/06/07

Keywords

  • Clustering
  • Generative model
  • Nearest neighbor

Fingerprint

Dive into the research topics of 'Finding near neighbors through cluster pruning'. Together they form a unique fingerprint.

Cite this