Bug Localization Using Revision Log Analysis and Open Bug Repository Text Categorization (original) (raw)

Code Complexity and Version History for Enhancing Hybrid Bug Localization

IEEE Access, 2021

Software projects are not void from bugs when they are released, so the developers keep receiving bug reports that describe technical issues. The process of identifying the buggy code files that correspond to the submitted bug reports is called bug localization. Automating the bug localization process can speed up bug fixing and improve the productivity of the developers, especially with a large number of submitted bug reports. Several automatic bug localization approaches were proposed in the literature reviews which are based on the textual and /or semantic similarity among the bug reports and the source code files. Nevertheless, none of the previous approaches made use of the source code complexity despite its importance; as high complexity source code files have higher probabilities to be modified than the low complexity files and are prone to bug occurrences. To improve the accuracy of the automatic bug localization task, this paper proposes a Hybrid Bug Localization approach (HBL) that makes full use of textual and semantic features of source code files, previously fixed bug reports, in addition to the source code complexity and version history properties. The effectiveness of the proposed approach was assessed using three opensource Java projects, ZXing, SWT, and AspectJ, of different sizes. Experimental results showed that the proposed approach outperforms several state-of-the-art approaches in terms of the mean average precision (MAP) and the mean reciprocal rank (MRR) metrics.

Locating Bug IDs and Development Logs in Open Source Software (OSS) projects: An Experience Report

2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), 2018

The development logs of software projects, contained in Version Control (VC) systems can be severely incomplete when tracking bugs, especially in open source projects, resulting in a reduced traceability of defects. Other times, such logs can contain bug information that is not available in bug tracking system (BT system) repositories, and vice-versa: if development logs and BT system data were used together, researchers and practitioners often would have a larger set of bug IDs for a software project, and a better picture of a bug life cycle, its evolution and maintenance. Considering a sample of 10 OSS projects and their development logs and BT systems data, the two objectives of this paper are (i) to determine which of the keywords 'Fix', 'Bug' or the '#' identifier provide the better precision; and (ii) to analyse their respective precision and recall at locating the larger amount possible of bug IDs manually. Overall, our results suggest that the use of the '#' identifier in conjunction with the bug ID digits (e.g., #1234) is more precise for locating bugs in development logs, than the use of the 'Bug' and 'Fix' keywords. Such keywords are indeed present in the development logs, but they are less useful when trying to connect the development actions with the bug traces in software project.

Managing Open Bug Repositories through Bug Report Prioritization Using SVMs

Managing the incoming deluge of new bug reports received in bug repository of a large open source project is a challenging task. Handling these reports manually by developers, consume time and resources which results in delaying the resolution of crucial (important) bugs which need to be identified and resolved earlier to prevent major losses in a software project. In this paper, we present a machine learning approach to develop a bug priority recommender which automatically assigns an appropriate priority level to newly arrived bugs, so that they are resolved in order of importance and an important bug is not left untreated for a long time. Our approach is based on the classification technique, for which we use Support Vector Machines. Experimental evaluation of our recommender using precision and recall measures reveals the feasibility of our approach for automatic bug priority assignment.

Improving Bug Localization using IR-based Textual Similarity and Vectorization Scoring Framework

2020

The major challenge faced by software industry is meeting deadlines in delivering quality product. The major reason behind delays is not only development part but basically detection and finding of bug or error. Whenever a bug is reported, developers use bug reports to reach to the code fragments that need to be modified to fix the bug. Suitable semantic information is present in bug reports and developers start exhaustive searching manually to catch the bug location. To minimize this manual effort, a framework on Information retrieval based bug localization is proposed that exploits the textual content of bug report to provide the rank relevant buggy source files i.e. the file having higher probability of occurrence of bug. The dataset used consists of a total of 925 bugs from 4 project categories SWT, ZXing, Eclipse and AspectJ. This framework outputs the Top N, here top (related) terms top 5 ranked sequence terms, showing the file containing these terms having higher probability ...

Automated classification of software issue reports using machine learning techniques: an empirical study

Innovations in Systems and Software Engineering, 2017

Software developers, testers and customers routinely submit issue reports to software issue trackers to record the problems they face in using a software. The issues are then directed to appropriate experts for analysis and fixing. However, submitters often misclassify an improvement request as a bug and vice versa. This costs valuable developer time. Hence automated classification of the submitted reports would be of great practical utility. In this paper, we analyze how machine learning techniques may be used to perform this task. We apply different classification algorithms, namely naive Bayes, linear discriminant analysis, k-nearest neighbors, support vector machine (SVM) with various kernels, decision tree and random forest separately to classify the reports from three open-source projects. We evaluate their performance in terms of F-measure, average accuracy and weighted average F-measure. Our experiments show that random forests perform best, while SVM with certain kernels also achieve high performance.

Mining Software Repositories for Defect Categorization

Journal of Communications Software and Systems, 2015

Early detection of software defects is very important to decrease the software cost and subsequently increase the software quality. Success of software industries not only depends on gaining knowledge about software defects, but largely reflects from the manner in which information about defect is collected and used. In software industries, individuals at different levels from customers to engineers apply diverse mechanisms to detect the allocation of defects to a particular class. Categorizing bugs based on their characteristics helps the Software Development team take appropriate actions to reduce similar defects that might get reported in future releases. Classification, if performed manually, will consume more time and effort. Human resource having expert testing skills & domain knowledge will be required for labeling the data. Therefore, the need of automatic classification of software defect is high. This work attempts to categorize defects by proposing an algorithm called Software Defect CLustering (SDCL). It aims at mining the existing online bug repositories like Eclipse, Bugzilla and JIRA for analyzing the defect description and its categorization. The proposed algorithm is designed by using text clustering and works with three major modules to find out the class to which the defect should be assigned. Software bug repositories hold software defect data with attributes like defect description, status, defect open and close date. Defect extraction module extracts the defect description from various bug repositories and converts it into unified format for further processing. Unnecessary and irrelevant texts are removed from defect data using data preprocessing module. Finally grouping of defect data into clusters of similar defect is done using clustering technique. The algorithm provides classification accuracy more than 80% in all of the three above mentioned repositories.

Bug Report Triaging Using Textual, Categorical and Contextual Features Using Latent Dirichlet Allocation

Software Bugs occur for a wide range of reasons. Bug reports can be generated automatically or drafted by user of software. Bug reports can also go with other malfunctions of the software, mostly for the beta or unsteady versions of the software. Most often, these bug reports are improved with user contributed experiences as to know what in fact faced by him/her. Addressing these bugs accounts for the majority of effort spent in the maintenance phase of a software project life cycle. Most often, several bug reports, sent by different users, match up to the same defect. Nevertheless, every bug report is to be analyzed separately and carefully for the possibility of a potential bug. The person responsible for processing the newly reported bugs, checking for duplicates and passing them to suitable developers to get fixed is called a Triager and this process is called Triaging. The utility of bug tracking systems is hindered by a large number of duplicate bug reports. In many open source software projects, as many as one third of all reports are duplicates. This identification of duplicacy in bug reports is time-taking and adds to the already high cost of software maintenance. In this dissertation, a model of automated triaging process is proposed based on textual, categorical and contextual similarity features. The contribution of this dissertation is twofold. In the proposed scheme a total of 80 textual features are extracted from the bug reports. Moreover, topics are modeled from the complete set of text corpus using Latent Dirichlet Allocation (LDA). These topics are specific to the category, class or functionality of the software. For e.g., possible list of topics for android bug repository might be Bluetooth, Download, Network etc. Bug reports are analyzed for context, to relate them to the domain specific topics of the software, thereby; enhancing the feature set which is used for tabulating similarity score. Finally, two sets are made for duplicates and non-duplicate bug reports for binary classification using Support Vector Machine. Simulation is performed over a dataset of Bugzilla. The proposed system improves the efficiency of duplicacy checking by 15 % as compared to the contextual model proposed by Anahita Alipour et.al. The system is able to reduce development cost by improvising the duplicity checking while allowing at least one bug report for each real defect to reach developers.

Online SVM based Optimized Bug Report Triaging using Feature Extraction

Triage is medical term referring to the process of prioritizing patients based on the severity of their condition so as to maximize benefit (help as many as possible) when resources are limited. Bug Report triaging is a process where tracker issues are screened and prioritized. Triage should help ensure that all reported issues are properly managed - bugs as well as improvements and feature requests. The large number of new bug reports received in bug repositories of software systems makes their management a challenging task. Handling these reports manually is time consuming, and often results in delaying the resolution of important bugs. The most critical issue related with bug reports is that their number is vast and most of these are duplicates of some previously sent bug report. The solution to this problem requires that bug reports are to be categorized in groups where each group consist of all the bug reports which belongs to the same bug, and the number of groups is equal to the number of unique bugs addressed so far. Bug report corresponding to some new bug is to be placed in a separate group followed by its duplicates, if any. Classifying weather a bug report that arrived through a user, written in a natural language, is a duplicate or unique report is a time consuming task, especially when the number of bug reports that are received is large. Thus, this process needs to be automated. Bug reports have textual, contextual and categorical features and these features needs to be extracted for checking of duplicates and non duplicates. Moreover, in the group of reports, a particular report can be specified as master and all the reports that corresponds to the same bug are to be linked to it. Thus, duplicates need not be discarded so as to provide later, a complete description of the bug. In this paper, a much more extended set of textual features is considered for bug report duplicacy checking. Support Vector Machine classifier is used for classification of the incoming bug report as duplicate of non-duplicates. The simulation of the prescribed model is done using R Statistical Package. A sample of bug reports from Mozilla repository is considered, Results of the simulation model establishes the fact that Proposed classifier has higher efficiency as compared to existing technique BM25F which employs 25 feature sets.

Vulnerability identification and classification via text mining bug databases

IECON 2014 - 40th Annual Conference of the IEEE Industrial Electronics Society, 2014

As critical and sensitive systems increasingly rely on complex software systems, identifying software vulnerabilities is becoming increasingly important. It has been suggested in previous work that some bugs are only identified as vulnerabilities long after the bug has been made public. These bugs are known as Hidden Impact Bugs (HIBs). This paper presents a hidden impact bug identification methodology by means of text mining bug databases. The presented methodology utilizes the textual description of the bug report for extracting textual information. The text mining process extracts syntactical information of the bug reports and compresses the information for easier manipulation. The compressed information is then utilized to generate a feature vector that is presented to a classifier. The proposed methodology was tested on Linux vulnerabilities that were discovered in the time period from 2006 to 2011. Three different classifiers were tested and 28% to 88% of the hidden impact bugs were identified correctly by using the textual information from the bug descriptions alone. Further analysis of the Bayesian detection rate showed the applicability of the presented method according to the requirements of a development team.

A Supervised Bug Report Classification with Incorporate and Textual field Knowledge

Procedia Computer Science, 2018

Performance of the bug prediction model is directly depends on the misclassification of bug reports. Misclassification issue surely scarifies the accuracy of the system. To resolve this issue the manual examination of bug reports are required, but it is very time consuming and tedious job for a developer and tester. In this paper the hybrid approach of merging text mining, natural language processing and machine learning techniques is used to identify bug report as bug or non-bug. The four incorporates fields with textual fields are added to bug reports to improve the performance of classifier. TF-IDF and Bigram feature extraction methods are used with feature selection and K-nearest neighbor (K-NN) classifier. The performance of the proposed system is evaluated by using Precision, Recall and F-measure by using five datasets. It is observed that the performance of K-NN classifier is changed according to the dataset and addition of bigram method improve the performance of classifier.