Software Defect Prediction Using Supervised Machine Learning and Ensemble Techniques: A Comparative Study (original) (raw)

Software Defect Prediction Using Ensemble Learning: A Systematic Literature Review

IEEE Access

Recent advances in the domain of software defect prediction (SDP) include the integration of multiple classification techniques to create an ensemble or hybrid approach. This technique was introduced to improve the prediction performance by overcoming the limitations of any single classification technique. This research provides a systematic literature review on the use of the ensemble learning approach for software defect prediction. The review is conducted after critically analyzing research papers published since 2012 in four well-known online libraries: ACM, IEEE, Springer Link, and Science Direct. In this study, five research questions covering the different aspects of research progress on the use of ensemble learning for software defect prediction are addressed. To extract the answers to identified questions, 46 most relevant papers are shortlisted after a thorough systematic research process. This study will provide compact information regarding the latest trends and advances in ensemble learning for software defect prediction and provide a baseline for future innovations and further reviews. Through our study, we discovered that frequently employed ensemble methods by researchers are the random forest, boosting, and bagging. Less frequently employed methods include stacking, voting and Extra Trees. Researchers proposed many promising frameworks, such as EMKCA, SMOTE-Ensemble, MKEL, SDAEsTSE, TLEL, and LRCR, using ensemble learning methods. The AUC, accuracy, F-measure, Recall, Precision, and MCC were mostly utilized to measure the prediction performance of models. WEKA was widely adopted as a platform for machine learning. Many researchers showed through empirical analysis that features selection, and data sampling was necessary pre-processing steps that improve the performance of ensemble classifiers. INDEX TERMS Systematic literature review (SLR), ensemble classifier, hybrid classifier, software defect prediction.

Software Defects Prediction At Method Level Using Ensemble Learning Techniques

International Journal of Intelligent Computing and Information Sciences

Creating error-free software artifacts is essential to increase software quality and potential re-usability. However, testing software artifacts to find defects and fix them is time consuming and costly, thus predicting the most error-prone software components can optimize the testing process by focusing testing resources on those components to save time and money. Much software defect prediction research has focused on higher granularity, e.g., file and package levels, and fewer have focused on the method level. In this paper, software defect prediction will be performed on highly imbalanced method-level datasets extracted from 23 open source Java projects. Eight ensemble learning algorithms will be applied to the datasets: Ada-Boost, Bagging, Gradient boost, Random Forest, Random Under sampling Boost, Easy Ensemble, Balanced Bagging and Balanced Random Forest. The results showed that the Balanced Random Forest classifier achieved the best results regarding recall and roc_auc values.

Software Defect Prediction Using the Machine Learning Methods

Problems of Information Technology

Reliability of software systems is one of the main indicators of quality. Defects occurring when developing software systems have a direct effect on reliability. Precise prediction of defects in software systems helps software engineers to ensure the reliability of software systems and to properly allocate resources for the trial process. The development of an ensemble method by combining several classification methods occupies one of the main places in research conducted in the field of error prediction in software modules. This paper proposes a method based on the application of ensemble training for defect detection. Here, a database obtained from PROMISE and GITHUB software engineering registry is used to detect defects. Experiments are conducted using Weka software. The prediction efficiency is evaluated based on F-measure and ROC-area. As a result of experiments, the defect detection accuracy of the proposed method is proven to be higher than that of individual machine learnin...

Early Prediction of Software Defect using Ensemble Learning: A Comparative Study

Recently, early prediction of software defects using the machine learning techniques has attracted more attention of researchers due to its importance in producing a successful software. On the other side, it reduces the cost of software development and facilitates procedures to identify the reasons for determining the percentage of defect-prone software in future. There is no conclusive evidence for specific types of machine learning that will be more efficient and accurate to predict of software defects. However, some of the previous related work proposes the ensemble learning techniques as a more accurate alternative. This paper introduces the resample technique with three types of ensemble learners; Boosting, Bagging and Rotation Forest, using eight of base learner tested on seven types of benchmark datasets provided in the PROMISE repository. Results indicate that accuracy has been improved using ensemble techniques more than single leaners especially in conjunction with Rotation Forest with the resample technique in most of the algorithms used in the experimental results.

Software defect prediction using ensemble learning on selected features

Information and Software Technology, 2015

Context: Several issues hinder software defect data including redundancy, correlation, feature irrelevance and missing samples. It is also hard to ensure balanced distribution between data pertaining to defective and non-defective software. In most experimental cases, data related to the latter software class is dominantly present in the dataset. Objective: The objectives of this paper are to demonstrate the positive effects of combining feature selection and ensemble learning on the performance of defect classification. Along with efficient feature selection, a new two-variant (with and without feature selection) ensemble learning algorithm is proposed to provide robustness to both data imbalance and feature redundancy. Method: We carefully combine selected ensemble learning models with efficient feature selection to address these issues and mitigate their effects on the defect classification performance. Results: Forward selection showed that only few features contribute to high area under the receiveroperating curve (AUC). On the tested datasets, greedy forward selection (GFS) method outperformed other feature selection techniques such as Pearson's correlation. This suggests that features are highly unstable. However, ensemble learners like random forests and the proposed algorithm, average probability ensemble (APE), are not as affected by poor features as in the case of weighted support vector machines (W-SVMs). Moreover, the APE model combined with greedy forward selection (enhanced APE) achieved AUC values of approximately 1.0 for the NASA datasets: PC2, PC4, and MC1. Conclusion: This paper shows that features of a software dataset must be carefully selected for accurate classification of defective components. Furthermore, tackling the software data issues, mentioned above, with the proposed combined learning model resulted in remarkable classification performance paving the way for successful quality control.