A Sketch-Based Naive Bayes Algorithms for Evolving Data Streams

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

A well-known learning task in big data stream mining is classification. Extensively studied in the offline setting, in the streaming setting - where data are evolving and even infinite - it is still a challenge. In the offline setting, training needs to store all the data in memory for the learning task; yet, in the streaming setting, this is impossible to do due to the massive amount of data that is generated in real-time. To cope with these resource issues, this paper proposes and analyzes several evolving naive Bayes classification algorithms, based on the well-known count-min sketch, in order to minimize the space needed to store the training data. The proposed algorithms also adapt concept drift approaches, such as ADWIN, to deal with the fact that streaming data may be evolving and change over time. However, handling sparse, very high-dimensional data in such framework is highly challenging. Therefore, we include the hashing trick, a technique for dimensionality reduction, to compress that down to a lower dimensional space, which leads to a large memory saving.We give a theoretical analysis which demonstrates that our proposed algorithms provide a similar accuracy quality to the classical big data stream mining algorithms using a reasonable amount of resources. We validate these theoretical results by an extensive evaluation on both synthetic and real-world datasets.

Original languageEnglish
Title of host publicationProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018
EditorsNaoki Abe, Huan Liu, Calton Pu, Xiaohua Hu, Nesreen Ahmed, Mu Qiao, Yang Song, Donald Kossmann, Bing Liu, Kisung Lee, Jiliang Tang, Jingrui He, Jeffrey Saltz
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages604-613
Number of pages10
ISBN (Electronic)9781538650356
DOIs
Publication statusPublished - 2 Jul 2018
Externally publishedYes
Event2018 IEEE International Conference on Big Data, Big Data 2018 - Seattle, United States
Duration: 10 Dec 201813 Dec 2018

Publication series

NameProceedings - 2018 IEEE International Conference on Big Data, Big Data 2018

Conference

Conference2018 IEEE International Conference on Big Data, Big Data 2018
Country/TerritoryUnited States
CitySeattle
Period10/12/1813/12/18

Keywords

  • Concept drift
  • Count-min sketch
  • Data stream classification
  • Hashing trick
  • Naive Bayes

Fingerprint

Dive into the research topics of 'A Sketch-Based Naive Bayes Algorithms for Evolving Data Streams'. Together they form a unique fingerprint.

Cite this