Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset (original) (raw)

Abstract

The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belong to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.

Preview

Unable to display preview. Download preview PDF.

References

Chawla, N. V.: C4.5 and Imbalanced Datasets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. Proceedings of the ICML’03 Workshop on Class Imbalances, (2003)
Google Scholar
Chawla, N. V., Bowyer, K.W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16 (2002) 321–357
MATH Google Scholar
Caragea, D., Cook, D., Honavar, V.: Gaining Insights into Support Vector Machine Pattern Classifiers Using Projection-Based Tour Methods. Proceedings of the KDD Conference, San Francisco, CA (2001) 251–256
Google Scholar
Chawla, N. V., Lazarevic, A., Hall, L. O., Bowyer, K. W.: Smoteboost: Improving Prediction of the Minority Class in Boosting. Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, Dubrovnik, Croatia (2003) 107–119
Google Scholar
Clark, P., Niblett, T.: The CN2 Induction Algorithm. Machine Learning, 3 (1989) 261–283
Google Scholar
Drummond, C., Holte, R. C.: C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling. Proceedings of the ICML’03 Workshop on Learning from Imbalanced Datasets, (2003)
Google Scholar
Del-Hoyo, R., Buldain, D., Marco, A.: Supervised Classification with Associative SOM. Lecture Notes in Computer Science, 2686 (2003) 334–341
Article Google Scholar
Japkowicz, N.: Concept-learning in the Presence of Between-class and Within-class Imbalances. Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, (2001) 67–77
Google Scholar
Zhang, J., Mani, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets, (2003).
Google Scholar
Chyi, Y. M.: Classification Analysis Techniques for Skewed Class Distribution Problems. Master Thesis, Department of Information Management, National Sun Yat-Sen University, (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, Ming Chuan University, 5 The-Ming Rd., Gwei Shan District, Taoyuan County, 333, Taiwan
Show-Jane Yen & Yue-Shi Lee

Authors

Show-Jane Yen
You can also search for this author inPubMed Google Scholar
Yue-Shi Lee
You can also search for this author inPubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui, China
De-Shuang Huang
Queen’s University, Belfast, UK
Kang Li & George William Irwin &

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Yen, SJ., Lee, YS. (2006). Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset. In: Huang, DS., Li, K., Irwin, G.W. (eds) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol 344. Springer, Berlin, Heidelberg . https://doi.org/10.1007/978-3-540-37256-1\_89

Download citation

.RIS
.ENW
.BIB
DOI: https://doi.org/10.1007/978-3-540-37256-1\_89
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37255-4
Online ISBN: 978-3-540-37256-1
eBook Packages: Engineering Engineering (R0)