Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset (original) (raw)

Abstract

The most important factor of classification for improving classification accuracy is the training data. However, the data in real-world applications often are imbalanced class distribution, that is, most of the data are in majority class and little data are in minority class. In this case, if all the data are used to be the training data, the classifier tends to predict that most of the incoming data belong to the majority class. Hence, it is important to select the suitable training data for classification in the imbalanced class distribution problem. In this paper, we propose cluster-based under-sampling approaches for selecting the representative data as training data to improve the classification accuracy for minority class in the imbalanced class distribution problem. The experimental results show that our cluster-based under-sampling approaches outperform the other under-sampling techniques in the previous studies.

Preview

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Chawla, N. V.: C4.5 and Imbalanced Datasets: Investigating the Effect of Sampling Method, Probabilistic Estimate, and Decision Tree Structure. Proceedings of the ICML’03 Workshop on Class Imbalances, (2003)
    Google Scholar
  2. Chawla, N. V., Bowyer, K.W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 16 (2002) 321–357
    MATH Google Scholar
  3. Caragea, D., Cook, D., Honavar, V.: Gaining Insights into Support Vector Machine Pattern Classifiers Using Projection-Based Tour Methods. Proceedings of the KDD Conference, San Francisco, CA (2001) 251–256
    Google Scholar
  4. Chawla, N. V., Lazarevic, A., Hall, L. O., Bowyer, K. W.: Smoteboost: Improving Prediction of the Minority Class in Boosting. Proceedings of the Seventh European Conference on Principles and Practice of Knowledge Discovery in Databases, Dubrovnik, Croatia (2003) 107–119
    Google Scholar
  5. Clark, P., Niblett, T.: The CN2 Induction Algorithm. Machine Learning, 3 (1989) 261–283
    Google Scholar
  6. Drummond, C., Holte, R. C.: C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling Beats Over-Sampling. Proceedings of the ICML’03 Workshop on Learning from Imbalanced Datasets, (2003)
    Google Scholar
  7. Del-Hoyo, R., Buldain, D., Marco, A.: Supervised Classification with Associative SOM. Lecture Notes in Computer Science, 2686 (2003) 334–341
    Article Google Scholar
  8. Japkowicz, N.: Concept-learning in the Presence of Between-class and Within-class Imbalances. Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of Intelligence, (2001) 67–77
    Google Scholar
  9. Zhang, J., Mani, I.: KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Datasets, (2003).
    Google Scholar
  10. Chyi, Y. M.: Classification Analysis Techniques for Skewed Class Distribution Problems. Master Thesis, Department of Information Management, National Sun Yat-Sen University, (2003)
    Google Scholar

Download references

Author information

Authors and Affiliations

  1. Department of Computer Science and Information Engineering, Ming Chuan University, 5 The-Ming Rd., Gwei Shan District, Taoyuan County, 333, Taiwan
    Show-Jane Yen & Yue-Shi Lee

Authors

  1. Show-Jane Yen
    You can also search for this author inPubMed Google Scholar
  2. Yue-Shi Lee
    You can also search for this author inPubMed Google Scholar

Editor information

Editors and Affiliations

  1. Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui, China
    De-Shuang Huang
  2. Queen’s University, Belfast, UK
    Kang Li & George William Irwin &

Rights and permissions

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Yen, SJ., Lee, YS. (2006). Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset. In: Huang, DS., Li, K., Irwin, G.W. (eds) Intelligent Control and Automation. Lecture Notes in Control and Information Sciences, vol 344. Springer, Berlin, Heidelberg . https://doi.org/10.1007/978-3-540-37256-1\_89

Download citation

Publish with us