TY - GEN
T1 - An active learning method based on mistake sampling for large scale imbalanced classification
AU - Guo, Jia
AU - Wan, Xin
AU - Lin, Hao
AU - Li, Peng
AU - Liu, Guannan
AU - He, Yueying
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/28
Y1 - 2017/7/28
N2 - Nowadays, the challenge of learning from large scale and imbalanced data set have attracted a great deal of attention from both industry and academia, which is also deemed to be an important task for fraud detection in telecommunication, finance, online commerce. In general, it's almost impossible to train a classification model on the complete data set, especially in the era of big data, due to the space-time complexity. Thus, how to sample a training set from the original large-scale set that can provide a more accurate prediction result has become a focal point of study. Active learning provides a way to iteratively add a small batch of data to the initial training set at one time, such that a training set can be augmented with informative samples. However, when tackling with extremely imbalanced data, active learning methods can be invalid. To that end, in this paper, we proposed a novel method to sample the training set based on active learning, in order to solve large scale and imbalanced learning problem. Moreover, we exploit SMOTE, one of the most widely used resampling methods to balance the training set. The experiment was conducted on real world data from the industry of telecommunications. As the result presents, our proposed solution showed a steady and better performance compared to those widely used active learning methods.
AB - Nowadays, the challenge of learning from large scale and imbalanced data set have attracted a great deal of attention from both industry and academia, which is also deemed to be an important task for fraud detection in telecommunication, finance, online commerce. In general, it's almost impossible to train a classification model on the complete data set, especially in the era of big data, due to the space-time complexity. Thus, how to sample a training set from the original large-scale set that can provide a more accurate prediction result has become a focal point of study. Active learning provides a way to iteratively add a small batch of data to the initial training set at one time, such that a training set can be augmented with informative samples. However, when tackling with extremely imbalanced data, active learning methods can be invalid. To that end, in this paper, we proposed a novel method to sample the training set based on active learning, in order to solve large scale and imbalanced learning problem. Moreover, we exploit SMOTE, one of the most widely used resampling methods to balance the training set. The experiment was conducted on real world data from the industry of telecommunications. As the result presents, our proposed solution showed a steady and better performance compared to those widely used active learning methods.
KW - Active learning
KW - Fraud detection
KW - Imbalanced classification
KW - Resampling
UR - https://www.scopus.com/pages/publications/85028624657
U2 - 10.1109/ICSSSM.2017.7996301
DO - 10.1109/ICSSSM.2017.7996301
M3 - 会议稿件
AN - SCOPUS:85028624657
T3 - 14th International Conference on Services Systems and Services Management, ICSSSM 2017 - Proceedings
BT - 14th International Conference on Services Systems and Services Management, ICSSSM 2017 - Proceedings
A2 - Cai, Xiaoqiang
A2 - Tang, Jiafu
A2 - Chen, Jian
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th International Conference on Services Systems and Services Management, ICSSSM 2017
Y2 - 16 June 2017 through 18 June 2017
ER -