Cross-Project Defect Prediction: A Literature Review (original) (raw)

Feature Selection in Cross-Project Software Defect Prediction

Journal of Physics: Conference Series, 2020

Advances in technology have increased the use and complexity of software. The complexity of the software can increase the possibility of defects. Defective software can cause high losses. Fixing defective software requires a high cost because it can spend up 50% of the project schedule. Most software developers don't document their work properly so that making it difficult to analyse software development history data. Software metrics which use in cross-project software defects prediction have many features. Software metrics usually consist of various measurement techniques, so there are possibilities for their features to be similar. It is possible that these features are similar or irrelevant so that they can cause a decrease in the performance of classifiers. In this study, several feature selection techniques were proposed to select the relevant features. The classification algorithm used is Naive Bayes. Based on the analysis using ANOVA, the SBS and SBFS models can significantly improve the performance of the Naïve Bayes model.

The influence of machine learning on the predictive performance of cross-project defect prediction: empirical analysis

TELKOMNIKA Telecommunication Computing Electronics and Control, 2024

This empirical investigation delves into the influence of machine learning (ML) algorithms in the realm of cross-project defect prediction, employing the AEEEEM dataset as a foundation. The primary objective is to discern the nuanced influences of various algorithms on predictive performance, with a specific focus on the F1 score metric as evaluation criterion. Four ML algorithms have been carefully assessed in this study: random forest (RF), support vector machines (SVM), k-nearest neighbors (KNN), and logistic regression (LR). The choice of these algorithms reflects their prevalence in software defect prediction literature and their diversity. Through rigorous experimentation and analysis, the investigation unveils compelling evidence affirming the superiority of RF over its counterparts. The F1 score utilized as evaluation metric, capturing the delicate balance between precision and recall, essential in defect prediction scenarios. The nuanced examination of algorithmic efficacy provides practical insights for developers and practitioners navigating the challenges of cross-project defect prediction. By leveraging the rich and diverse AEEEEM dataset, this study ensures a comprehensive exploration of algorithmic influences across varied software projects. The findings not only contribute to the academic discourse on defect prediction but also offer practical guidance for real-world application, emphasizing the pivotal role of RF as a tool in enhancing predictive accuracy and reliability.

Comparative Analysis of Software Defect PredictionTechniques

2019

Accurate prediction of defects in software components plays a vital role in administrating the quality of the quality and efficiency of the system to be developed. So we have written a systematic literature review in order to evaluate the four main defect prediction techniques. Defect prediction paves way for the testers to find bugs and modify them in order to achieve input to output conformance. In this paper we have discussed the open issues in predicting software defects and have provided with a detailed analyzation of different methods including Machine Learning, Integrated Approach, Cross-Project and Deep Forest algorithm in order to prevent these flaws. However, it is almost impossible to rule which method is better than the other so every technique can be analyzed separately and the best technique according to the problem at hand can be used or can be altered to create hybrid technique suitable for the cause.

Software Defect Prediction Techniques in Software Engineering: A Review

CERN European Organization for Nuclear Research - Zenodo, 2022

Defect prediction is one of the significant challenges in the software development lifecycle for improving software quality and reducing program testing time and cost. Developing a defect prediction model is a difficult task, and several techniques have been developed over time. Previous reviews focused on defect prediction in general, and none have specifically addressed defect prediction based on the semantic representation of programs from source code. This review presents a comprehensive and holistic survey of software defect research over three decades, covering motivations, datasets, state-of-the-art techniques, challenges, and future research directions. We specifically concentrate on source code semantic-based methods. We also give particular attention to the techniques based on semantic features because it presents the field's current state of the art. We focus on the process of cross-project defect prediction (CPDP), within-project defect prediction (WPDP), and the most recently used datasets. Defect datasets for 60 projects in different programming languages (C, Java, and C++) are presented and analyzed. Open issues are studied, and potential research directions in defect prediction are proposed to supply the reader with a point of reference for important topics that deserve study.

A Systematic Literature Review and Meta-Analysis on Cross Project Defect Prediction

2019

Background: Cross project defect prediction (CPDP) recently gained considerable attention, yet there are no systematic efforts to analyse existing empirical evidence. Objective: To synthesise literature to understand the state-of-the-art in CPDP with respect to metrics, models, data approaches, datasets and associated performances. Further, we aim to assess the performance of CPDP versus within project DP models. Method: We conducted a systematic literature review. Results from primary studies are synthesised (thematic, meta-analysis) to answer research questions. Results: We identified 30 primary studies passing quality assessment. Performance measures, except precision, vary with the choice of metrics. Recall, precision, f-measure, and AUC are the most common measures. Models based on Nearest-Neighbour and Decision Tree tend to perform well in CPDP, whereas the popular naïve Bayes yields average performance. Performance of ensembles varies greatly across f-measure and AUC. Data ap...

Improving Cross-Project Software Defect Prediction Method Through Transformation and Feature Selection Approach

IEEE Access

In the traditional software defect prediction methodology, the historical record (dataset) of the same project is partitioned into training and testing data. In a practical situation where the project to be predicted is new, traditional software defect prediction cannot be employed. An alternative method is cross-project defect prediction, where the historical record of one project (source) is used to predict the defect status of another project (target). The cross-project defect prediction method solves the limitations of the historical records in the traditional software defect prediction method. However, the performance of cross-project defect prediction is relatively low because of the distribution differences between the source and target projects. Furthermore, the software defect dataset used for cross-project defect prediction is characterized by high-dimensional features, some of which are irrelevant and contribute to low performance. To resolve these two issues, this study proposes a transformation and feature selection approach to reduce the distribution difference and high-dimensional features in cross-project defect prediction. A comparative experiment was conducted on publicly available datasets from the AEEEM. Analysis of the results obtained shows that the proposed approach in conjugation with random forest as the classification model outperformed the other four state-of-the-art cross-project defect prediction methods based on the commonly used performance evaluation metric F1_score.

Cross-project software defect prediction through multiple learning

Bulletin of Electrical Engineering and Informatics

Cross-project defect prediction is a method that predicts defects in one software project by using the historical record of another software project. Due to distribution differences and the weak classifier used to build the prediction model, this method has poor prediction performance. Crossproject defect prediction may perform better if distribution differences are reduced, and an appropriate individual classifier is chosen. However, the prediction performance of individual classifiers may be affected in some way by their weaknesses. As a result, in order to boost the accuracy of crossproject defect prediction predictions, this study proposed a strategy that makes use of multiple classifiers and selects attributes that are similar to one another. The proposed method's efficacy was tested using the Relink and AEEEM datasets in an experiment. The findings of the experiments demonstrated that the proposed method produces superior outcomes. To further validate the method, we employed the Wilcoxon sum rank test at 95% significance level. The approach was found to perform significantly better than the baseline methods.

SOFTWARE DEFECT PREDICTION: PAST PRESENT AND FUTURE

IAEME PUBLICATION, 2018

Software development calls for several defect prediction methodologies using critical parameters such as review effort measurement, test effort estimation, phase gate containment, change request cost, re-usability, size and quality to improve the quality of deliverables. Nonetheless, a lot of these methodologies are actually in development stages and further research is required to produce a strong and dependable model. Many research centers have started more research projects in these research areas. Through this study, we investigated research papers and categorized depending on the importance to user community. We conducted a survey on a software application defect prediction methodologies based on machine learning approaches as well as statistical approaches. This paper contains an outline of works that have been published so far and not a comprehensive review of all the papers published on the topic. We’re confident that the survey of ours will help researchers to under- stand developments in this particular field of study in an effective and easy manner. We have also introduced as well as discussed the latest trends in defect prediction.

A Systematic Literature Review of Software Defect Prediction: Research Trends, Datasets, Methods and Frameworks

Journal of Software Engineering, 2015

Recent studies of software defect prediction typically produce datasets, methods and frameworks which allow software engineers to focus on development activities in terms of defect-prone code, thereby improving software quality and making better use of resources. Many software defect prediction datasets, methods and frameworks are published disparate and complex, thus a comprehensive picture of the current state of defect prediction research that exists is missing. This literature review aims to identify and analyze the research trends, datasets, methods and frameworks used in software defect prediction research betweeen 2000 and 2013. Based on the defined inclusion and exclusion criteria, 71 software defect prediction studies published between January 2000 and December 2013 were remained and selected to be investigated further. This literature review has been undertaken as a systematic literature review. Systematic literature review is defined as a process of identifying, assessing, and interpreting all available research evidence with the purpose to provide answers for specific research questions. Analysis of the selected primary studies revealed that current software defect prediction research focuses on five topics and trends: estimation, association, classification, clustering and dataset analysis. The total distribution of defect prediction methods is as follows. 77.46% of the research studies are related to classification methods, 14.08% of the studies focused on estimation methods, and 1.41% of the studies concerned on clustering and association methods. In addition, 64.79% of the research studies used public datasets and 35.21% of the research studies used private datasets. Nineteen different methods have been applied to predict software defects. From the nineteen methods, seven most applied methods in software defect prediction are identified. Researchers proposed some techniques for improving the accuracy of machine learning classifier for software defect prediction by ensembling some machine learning methods, by using boosting algorithm, by adding feature selection and by using parameter optimization for some classifiers. The results of this research also identified three frameworks that are highly cited and therefore influential in the software defect prediction field. They are Menzies et al. Framework, Lessmann et al. Framework, and Song et al. Framework.

Moving from Cross-Project Defect Prediction to Heterogeneous Defect Prediction: A Partial Replication Study

ArXiv, 2021

Software defect prediction heavily relies on the metrics collected from software projects. Earlier studies often used machine learning techniques to build, validate, and improve bug prediction models using either a set of metrics collected within a project or across different projects. However, techniques applied and conclusions derived by those models are restricted by how identical those metrics are. Knowledge coming from those models will not be extensible to a target project if no sufficient overlapping metrics have been collected in the source projects. To explore the feasibility of transferring knowledge across projects without common labeled metrics, we systematically integrated Heterogeneous Defect Prediction (HDP) by replicating and validating the obtained results. Our main goal is to extend prior research and explore the feasibility of HDP and finally to compare its performance with that of its predecessor, Cross-Project Defect Prediction. We construct an HDP model on diff...