Fast simultaneous clustering and feature selection for binary data

Charlotte Laclau, Mohamed Nadif

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This paper addresses the problem of clustering binary data with feature selection within the context of maximum likelihood (ML) and classification maximum likelihood (CML) approaches. In order to efficiently perform the clustering with feature selection, we propose the use of an appropriate Bernoulli model. We derive two algorithms: Expectation-Maximization (EM) and Classification EM (CEM) with feature selection. Without requiring a knowledge of the number of clusters, both algorithms optimize two approximations of the minimum message length (MML) criterion. To exploit the advantages of EM for clustering and of CEM for fast convergence, we combine the two algorithms. With Monte Carlo simulations and by varying parameters of the model, we rigorously validate the approach. We also illustrate our contribution using real datasets commonly used in document clustering.

Original languageEnglish
Title of host publicationAdvances in Intelligent DataAnalysis XIII - 13th International Symposium, IDA 2014, Proceedings
EditorsHendrik Blockeel, Matthijs van Leeuwen, Veronica Vinciotti
PublisherSpringer Verlag
Pages192-202
Number of pages11
ISBN (Electronic)9783319125701
DOIs
Publication statusPublished - 1 Jan 2014
Externally publishedYes
EventPAKDD 2006 International Workshop on Knowledge Discovery in Life Science Literature, KDLL 2006 - Singapore, Singapore
Duration: 9 Apr 20069 Apr 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8819
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferencePAKDD 2006 International Workshop on Knowledge Discovery in Life Science Literature, KDLL 2006
Country/TerritorySingapore
CitySingapore
Period9/04/069/04/06

Fingerprint

Dive into the research topics of 'Fast simultaneous clustering and feature selection for binary data'. Together they form a unique fingerprint.

Cite this