TY - GEN
T1 - Classifying imbalanced data streams via dynamic feature group weighting with importance sampling
AU - Wu, Ke
AU - Edwards, Andrea
AU - Fan, Wei
AU - Gao, Jing
AU - Zhang, Kun
N1 - Publisher Copyright:
© SIAM.
PY - 2014
Y1 - 2014
N2 - Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.
AB - Data stream classification and imbalanced data learning are two important areas of data mining research. Each has been well studied to date with many interesting algorithms developed. However, only a few approaches reported in literature address the intersection of these two fields due to their complex interplay. In this work, we proposed an importance sampling driven, dynamic feature group weighting framework (DFGW-IS) for classifying data streams of imbalanced distribution. Two components are tightly incorporated into the proposed approach to address the intrinsic characteristics of concept-drifting, imbalanced streaming data. Specifically, the ever-evolving concepts are tackled by a weighted ensemble trained on a set of feature groups with each sub-classifier (i.e. a single classifier or an ensemble) weighed by its discriminative power and stable level. The un-even class distribution, on the other hand, is typically battled by the sub-classifier built in a specific feature group with the underlying distribution rebalanced by the importance sampling technique. We derived the theoretical upper bound for the generalization error of the proposed algorithm. We also studied the empirical performance of our method on a set of benchmark synthetic and real world data, and significant improvement has been achieved over the competing algorithms in terms of standard evaluation metrics and parallel running time. Algorithm implementations and datasets are available upon request.
KW - Class imbalance
KW - Data stream classification
KW - Ensemble weighting
KW - Feature group ensemble
KW - Importance sampling
UR - https://www.scopus.com/pages/publications/84936948351
U2 - 10.1137/1.9781611973440.83
DO - 10.1137/1.9781611973440.83
M3 - Conference contribution
AN - SCOPUS:84936948351
T3 - SIAM International Conference on Data Mining 2014, SDM 2014
SP - 722
EP - 730
BT - SIAM International Conference on Data Mining 2014, SDM 2014
A2 - Zaki, Mohammed
A2 - Obradovic, Zoran
A2 - Ning-Tan, Pang
A2 - Banerjee, Arindam
A2 - Kamath, Chandrika
A2 - Parthasarathy, Srinivasan
PB - Society for Industrial and Applied Mathematics Publications
T2 - 14th SIAM International Conference on Data Mining, SDM 2014
Y2 - 24 April 2014 through 26 April 2014
ER -