Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset (original) (raw)
Abstract
The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belong to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
- Chawla, N. V.: C4.5 and Imbalanced Datasets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. Proceedings of the ICML’03 Workshop on Class Imbalances, (2003)
Google Scholar - Chawla, N. V., Bowyer, K.W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16 (2002) 321–357
MATH Google Scholar - Caragea, D., Cook, D., Honavar, V.: Gaining Insights into Support Vector Machine Pattern Classifiers Using Projection-Based Tour Methods. Proceedings of the KDD Conference, San Francisco, CA (2001) 251–256
Google Scholar - Chawla, N. V., Lazarevic, A., Hall, L. O., Bowyer, K. W.: Smoteboost: Improving Prediction of the Minority Class in Boosting. Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, Dubrovnik, Croatia (2003) 107–119
Google Scholar - Clark, P., Niblett, T.: The CN2 Induction Algorithm. Machine Learning, 3 (1989) 261–283
Google Scholar - Drummond, C., Holte, R. C.: C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling. Proceedings of the ICML’03 Workshop on Learning from Imbalanced Datasets, (2003)
Google Scholar - Del-Hoyo, R., Buldain, D., Marco, A.: Supervised Classification with Associative SOM. Lecture Notes in Computer Science, 2686 (2003) 334–341
Article Google Scholar - Japkowicz, N.: Concept-learning in the Presence of Between-class and Within-class Imbalances. Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, (2001) 67–77
Google Scholar - Zhang, J., Mani, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets, (2003).
Google Scholar - Chyi, Y. M.: Classification Analysis Techniques for Skewed Class Distribution Problems. Master Thesis, Department of Information Management, National Sun Yat-Sen University, (2003)
Google Scholar
Author information
Authors and Affiliations
- Department of Computer Science and Information Engineering, Ming Chuan University, 5 The-Ming Rd., Gwei Shan District, Taoyuan County, 333, Taiwan
Show-Jane Yen & Yue-Shi Lee
Authors
- Show-Jane Yen
You can also search for this author inPubMed Google Scholar - Yue-Shi Lee
You can also search for this author inPubMed Google Scholar
Editor information
Editors and Affiliations
- Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui, China
De-Shuang Huang - Queen’s University, Belfast, UK
Kang Li & George William Irwin &
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Yen, SJ., Lee, YS. (2006). Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset. In: Huang, DS., Li, K., Irwin, G.W. (eds) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol 344. Springer, Berlin, Heidelberg . https://doi.org/10.1007/978-3-540-37256-1\_89
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/978-3-540-37256-1\_89
- Publisher Name: Springer, Berlin, Heidelberg
- Print ISBN: 978-3-540-37255-4
- Online ISBN: 978-3-540-37256-1
- eBook Packages: EngineeringEngineering (R0)