Benchmarking Classification Models for Software Defect Prediction: A Proposed Framework and Novel Findings (original) (raw)

A Comparison Framework of Classification Models for Software Defect Prediction

Software defects are expensive in quality and cost. The accurate prediction of defect-prone software modules can help direct test effort, reduce costs, and improve the quality of software. Machine learning classification algorithms is a popular approach for predicting software defect. Various types of classification algorithms have been applied for software defect prediction. However, no clear consensus on which algorithm perform best when individual studies are looked at separately. In this research, a comparison framework is proposed, which aims to benchmark the performance of a wide range of classification models within the field of software defect prediction. For the purpose of this study, 10 classifiers are selected and applied to build classification models and test their performance in 9 NASA MDP datasets. Area under curve (AUC) is employed as an accuracy indicator in our framework to evaluate the performance of classifiers. Friedman and Nemenyi post hoc tests are used to test for significance of AUC differences between classifiers. The results show that the logistic regression perform best in most NASA MDP datasets. Naive bayes, neural network, support vector machine and k* classifiers also perform well. Decision tree based classifiers tend to underperform, as well as linear discriminant analysis and k-nearest neighbor.

An empirical study on software defect prediction with a simplified metric set

Context: Software defect prediction plays a crucial role in estimating the most defect-prone components of software, and a large number of studies have pursued improving prediction accuracy within a project or across projects. However, the rules for making an appropriate decision between within-and cross-project defect prediction when available historical data are insufficient remain unclear. Objective: The objective of this work is to validate the feasibility of the predictor built with a simplified metric set for software defect prediction in different scenarios, and to investigate practical guidelines for the choice of training data, classifier and metric subset of a given project. Method: First, based on six typical classifiers, three types of predictors using the size of software metric set were constructed in three scenarios. Then, we validated the acceptable performance of the predictor based on Top-k metrics in terms of statistical methods. Finally, we attempted to minimize the Top-k metric subset by removing redundant metrics, and we tested the stability of such a minimum metric subset with one-way ANOVA tests. Results: The study has been conducted on 34 releases of 10 open-source projects available at the PROM-ISE repository. The findings indicate that the predictors built with either Top-k metrics or the minimum metric subset can provide an acceptable result compared with benchmark predictors. The guideline for choosing a suitable simplified metric set in different scenarios is presented in . Conclusion: The experimental results indicate that (1) the choice of training data for defect prediction should depend on the specific requirement of accuracy; (2) the predictor built with a simplified metric set works well and is very useful in case limited resources are supplied; (3) simple classifiers (e.g., Naïve Bayes) also tend to perform well when using a simplified metric set for defect prediction; and (4) in several cases, the minimum metric subset can be identified to facilitate the procedure of general defect prediction with acceptable loss of prediction precision in practice.

Evaluating defect prediction approaches: a benchmark and an extensive comparison

Empirical Software Engineering, 2012

Reliably predicting software defects is one of the holy grails of software engineering. Researchers have devised and implemented a plethora of defect/bug prediction approaches varying in terms of accuracy, complexity and the input data they require. However, the absence of an established benchmark makes it hard, if not impossible, to compare approaches.

Performance comparison of Machine learning classifiers in Software Defects Prediction

Background: In software development life cycle, software testing is the main stage which can minimize the defects of software. A domain which has receiving much attention of software researchers since past couple of years is software defects prediction (SDP). Its aim to minimize the cost, time and improve the efficiency of software. The main aim of this research is to show a comparative analysis of software defect prediction based on support vector machine SVM and extreme learning machine ELM. In this domain defect prediction models were created using three different prediction techniques based on test data and training data. i.e. cross-validation prediction, cross-version prediction and cross-project prediction. In this study we used cross version prediction approach, data from old version of a software is used as training data to develop the prediction model and the model is evaluated from same project of current version. Materials and Methods: In our studies, we consider three different versions of eclipse version control system then we had split the data into training and tested sets. We choose different object oriented metrics and algorithm to build our model, aiming to predict software defects in different versions. For training purpose of our model we used SVM and ELM. To validate our prediction models, we can calculate the performance of prediction model using some popular used measurement scales such as accuracy, precision, recall, AUC (Area under ROC curve). Results: By comparing the file based results of SVM and ELM we can find the average accuracy values and AUC. This means the extreme learning machine has the highest AUC value, but the value of accuracy is also close to SVM. And SVM have similar accuracy, and very close AUC value. Then we can see how these models perform in package based prediction. By comparing the data in package based prediction of SVM and ELM, the accuracy and AUC values shows thatSVM has best accuracy, but the value of AUC decreases apparently. So we can conclude that SVM has best prediction results in file based defects.The results demonstrate that support vector machine is best fit for the cross version defect prediction. Conclusion: Software testing has become more and more important in software reliability since last couple of years. But on software testing we are wasting much time, resource and money. Software defect prediction can help to improve the efficiency of software testing and guide the direct resource allocation. In this study, we discussed the key techniques including software metrics, classifiers, and defect prediction models and its evaluation.Python language is most widely use language especially in data science.

The impact of software metrics in NASA metric data program dataset modules for software defect prediction

TELKOMNIKA Telecommunication Computing Electronics and Control, 2024

This paper discusses software metrics and their impact on software defect prediction values in the NASA metric data program (MDP) dataset. The NASA MDP dataset consists of four categories of software metrics: halstead, McCabe, LoC, and misc. However, there is no study showing which metrics participate in increasing the area under the curve (AUC) value of the NASA MDP dataset. This study utilizes 12 modules from the NASA MDP dataset, where these 12 modules are being tested into 14 relationships of software metrics derived from the four existing metric categories. Subsequently, classification is performed using the k-nearest neighbor (kNN) method. The research concludes that software metrics have a significant impact on the AUC value, with the LoC+McCabe+misc metrics relationship influencing the improvement of the AUC value. However, the metrics relationship that has the most impact on achieving less optimal AUC values is McCabe. Halstead metric also plays a role in decreasing the performance of other metrics.

An empirical analysis of the effectiveness of software metrics and fault prediction model for identifying faulty classes

Computer Standards & Interfaces

Software fault prediction models are used to predict faulty modules at the very early stage of software development life cycle. Predicting fault proneness using source code metrics is an area that has attracted several researchers' attention. The performance of a model to assess fault proneness depends on the source code metrics which are considered as the input for the model. In this work, we have proposed a framework to validate the source code metrics and identify a suitable set of source code metrics with the aim to reduce irrelevant features and improve the performance of the fault prediction model. Initially, we applied a t-test analysis and univariate logistic regression analysis to each source code metric to evaluate their potential for predicting fault proneness. Next, we performed a correlation analysis and multivariate linear regression stepwise forward selection to find the right set of source code metrics for fault prediction. The obtained set of source code metrics are considered as the input to develop a fault prediction model using a neural network with five different training algorithms and three different ensemble methods. The effectiveness of the developed fault prediction models are evaluated using a proposed cost evaluation framework. We performed experiments on fifty six Open Source Java projects. The experimental results reveal that the model developed by considering the selected set of source code metrics using the suggested source code metrics validation framework as the input achieves better results compared to all other metrics. The experimental results also demonstrate that the fault prediction model is best suitable for projects with faulty classes less than the threshold value depending on fault identification efficiency (low-48.89%, median-39.26%, and high-27.86%).

Comparative Classifiers for Software Quality Assessment

International Journal of Engineering and Technology, 2012

Software defect predictive model can efficiently help improve software quality and lessen testing effort. A large number of predictive models are proposed in a software engineering literature, but this paper presents the proposed method in software defect prediction with the comparative results based on two classifiers, i.e., backpropagation neural network and radial basis functions with Gaussian kernels as classifiers. Comparative results on NASA dataset are demonstrated and analyzed on the basis of mean square error and percent of accuracy. Experimental results show that the neural network performs better prediction than the RBF in almost subsets of data from 5.76% to 6.75%.

Software defect prediction using static code metrics underestimates defect-proneness

International Symposium on Neural Networks, 2010

Many studies have been carried out to predict the presence of software code defects using static code metrics. Such studies typically report how a classifier performs with real world data, but usually no analysis of the predictions is carried out. An analysis of this kind may be worthwhile as it can illuminate the motivation behind the predictions and the severity

A set of measures designed to identify overlapped instances in software defect prediction

Computing, 2017

The performance of the learning models will intensely rely on the characteristics of the training data. The previous outcomes recommend that the overlapping between classes and the presence of noise have the most grounded impact on the performance of learning algorithm, and software defect datasets are no exceptions. The class overlap problem is concerned with the performance of machine learning classifiers critical problem is class overlap in which data samples appear as valid examples of more than one class which may be responsible for the presence of noise in datasets. We aim to investigate how the presence of overlapped instances in a dataset influences the classifier's performance, and how to deal with class overlapping problem. To have a close estimate of class overlapping, we have proposed four different measures namely, nearest enemy ratio, subconcept ratio, likelihood ratio and soft margin ratio. We performed our investigations using 327 binary defect classification datasets obtained from 54 software projects, where we first identified overlapped datasets using three data complexity measures proposed in the literature. We also include treatment effort into the prediction process. Subsequently, we used our proposed measures to find overlapped instances in the identified overlapped datasets. Our results indicated that by training a classifier on a training data free from overlapped instances led to an improved classifier performance on the test data containing overlapped instances. The classifiers perform significantly better when the evaluation measure takes the effort into account.