TY - GEN
T1 - Incremental ensemble classifier addressing non-stationary fast data streams
AU - Parker, Brandon S.
AU - Khan, Latifur
AU - Bifet, Albert
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2015/1/26
Y1 - 2015/1/26
N2 - Classification of data points in a data stream is a fundamentally different set of challenges than data mining on static data. While streaming data is often placed into the context of 'Big Data' (or more specifically 'Fast Data') wherein one-pass algorithms are used, true data streams offer additional hurdles due to their dynamic, evolving, and non-stationary nature. During the stream, the available labels (or concepts) often change, and a concept's definition in the feature space can also evolve (or drift) over time. The core issue is that the hidden generative function of the data is not a constant function, but rather evolves over time. This is known as a non-stationary distribution. In this paper, we describe a new approach to using ensembles for stream classification. While the core method is straightforward, it is specifically designed to adapt quickly with very little overhead to the dynamic and evolving nature of data streams generated from non-stationary functions. Our method, M3, is based on a weighted majority ensemble of heterogeneous model types where model weights are updated on-line using Reinforcement Learning techniques. We compare our method with current leading algorithms as implemented in the Massive Online Analysis (MOA) framework using UCI benchmark and synthetic stream generator data sets, and find that our method shows particularly strong gain over the baseline method when ground truth is of limited availability to the classifiers.
AB - Classification of data points in a data stream is a fundamentally different set of challenges than data mining on static data. While streaming data is often placed into the context of 'Big Data' (or more specifically 'Fast Data') wherein one-pass algorithms are used, true data streams offer additional hurdles due to their dynamic, evolving, and non-stationary nature. During the stream, the available labels (or concepts) often change, and a concept's definition in the feature space can also evolve (or drift) over time. The core issue is that the hidden generative function of the data is not a constant function, but rather evolves over time. This is known as a non-stationary distribution. In this paper, we describe a new approach to using ensembles for stream classification. While the core method is straightforward, it is specifically designed to adapt quickly with very little overhead to the dynamic and evolving nature of data streams generated from non-stationary functions. Our method, M3, is based on a weighted majority ensemble of heterogeneous model types where model weights are updated on-line using Reinforcement Learning techniques. We compare our method with current leading algorithms as implemented in the Massive Online Analysis (MOA) framework using UCI benchmark and synthetic stream generator data sets, and find that our method shows particularly strong gain over the baseline method when ground truth is of limited availability to the classifiers.
KW - Big Data
KW - Fast Data
KW - Stream mining
KW - classifier
KW - non-stationary distribution
UR - https://www.scopus.com/pages/publications/84936868559
U2 - 10.1109/ICDMW.2014.116
DO - 10.1109/ICDMW.2014.116
M3 - Conference contribution
AN - SCOPUS:84936868559
T3 - IEEE International Conference on Data Mining Workshops, ICDMW
SP - 716
EP - 723
BT - Proceedings - 14th IEEE International Conference on Data Mining Workshops, ICDMW 2014
A2 - Zhou, Zhi-Hua
A2 - Wang, Wei
A2 - Kumar, Ravi
A2 - Toivonen, Hannu
A2 - Pei, Jian
A2 - Zhexue Huang, Joshua
A2 - Wu, Xindong
PB - IEEE Computer Society
T2 - 14th IEEE International Conference on Data Mining Workshops, ICDMW 2014
Y2 - 14 December 2014
ER -