DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports (original) (raw)

An HMM-based approach for automatic detection and classification of duplicate bug reports

Information and Software Technology, 2019

Context: Software projects rely on their issue tracking systems to guide maintenance activities of software developers. Bug reports submitted to the issue tracking systems carry crucial information about the nature of the crash (such as texts from users or developers and execution information about the running functions before the occurrence of a crash). Typically, big software projects receive thousands of reports every day. Objective: The aim is to reduce the time and effort required to fix bugs while improving software quality overall. Previous studies have shown that a large amount of bug reports are duplicates of previously reported ones. For example, as many as 30% of all reports in for Firefox are duplicates. Method: While there exist a wide variety of approaches to automatically detect duplicate bug reports by natural language processing, only a few approaches have considered execution information (the so-called stack traces) inside bug reports. In this paper, we propose a novel approach that automatically detects duplicate bug reports using stack traces and Hidden Markov Models. Results: When applying our approach to Firefox and GNOME datasets, we show that, for Firefox, the average recall for Rank k =1 is 59%, for Rank k=2 is 75.55%. We start reaching the 90% recall from k=10. The Mean Average Precision (MAP) value is up to 76.5%. For GNOME, The recall at k=1 is around 63%, while this value increases by about 10% for k=2. The recall increases to 97% for k=11. A MAP value of up to 73% is achieved. Conclusion: We show that HMM and stack traces are a powerful combination for detecting and classifying duplicate bug reports in large bug repositories.

ANALYZING THE IMPACT OF SIMILARITY MEASURES IN DUPLICATE BUG REPORT DETECTION

IAEME PUBLICATION, 2020

Duplicate Bug Report Detection is one of the very important tasks which is done during the assignment of bug reports to the concerned developer. As the Bug Reports of Open-Source projects are usually submitted by persons all over the geographical locations, the submission process is uncoordinated. Moreover this un coordinated submission leads to duplicate bug reports also. Bug Report Triager has to usually go through the tedious process of manually detecting the duplicate bug reports. Automatic Duplicate Bug Report Detection assists in easing the work of detection of duplicate bug reports. Survey shows that calculation of bug reports on the basis of similarity measures is the best way to perform this task of duplicate bug report detection task as the unbalanced data leads to imbalancing problem for machine learning approach. In this paper, we analyze how the different similarity measures impact the task of duplicate bug reports. For our analysis purpose, we have used Levenshtein, Jaccard, Cosine, BM25 , LSI and K-Means similarity measures. By including these similarity measures for the analysis purpose, Natural Language Processing, Machine Learning and Information Retrieval techniques are covered.

An Approach to Detecting Duplicate Bug Reports using N-gram Features and Cluster Chrinkage Technique

Duplicate bug report describes problems for which there is already a report in a bug repository. For many open source projects, the number of duplicate reports represents a significant percentage of the repository, so automatic identification of duplicate reports are very important and need let’s avoid wasting time a triager spends in searching for duplicate bug reports of any incoming report. In this paper we want to present a novel approach which it can help better of duplicate bug report identification. The proposed approach has two novel features: firstly, use n-gram features for the task of duplicate bug report detection. Secondly, apply cluster shrinkage technique to improve the detection performance. We tested our approach on three popular open source projects: Apache, Argo UML, and SVN. We have also conducted empirical studies. The experimental results show that the proposed scheme can effectively improve the detection performance compared with previous methods.

Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection

arXiv (Cornell University), 2022

About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicate bug reports that are textually similar. However, in bug tracking systems, many duplicate bug reports might not be textually similar, for which the traditional techniques might fall short. In this paper, we conduct a large-scale empirical study to better understand the impacts of textual dissimilarity on the detection of duplicate bug reports. First, we collect a total of 92,854 bug reports from three open-source systems and construct two datasets containing textually similar and textually dissimilar duplicate bug reports. Then we determine the performance of three existing techniques in detecting duplicate bug reports and show that their performance is significantly poor for textually dissimilar duplicate reports. Second, we analyze the two groups of bug reports using a combination of descriptive analysis, word embedding visualization, and manual analysis. We found that textually dissimilar duplicate bug reports often miss important components (e.g., expected behaviors and steps to reproduce), which could lead to their textual differences and poor performance by the existing techniques. Finally, we apply domain-specific embedding to duplicate bug report detection problems, which shows mixed results. All these findings above warrant further investigation and more effective solutions for detecting textually dissimilar duplicate bug reports.

Software Engineering Domain Knowledge to Identify Duplicate Bug Reports

Earlier, many methodologies was proposed for detecting duplicate bug reports by comparing the textual content of bug reports to subject-specific contextual material, namely lists of software-engineering terms, such as non-functional requirements and architecture keywords. When a bug report includes a word in these word-list contexts, the bug report is measured to be linked with that context and this information is likely to improve bug-deduplication methods. Here, we recommend a technique to partially automate the extraction of contextual word lists from software-engineering literature. Evaluating this software-literature context technique on real-world bug reports creates useful consequences that indicate this semi-automated method has the potential to significantly decrease the manual attempt used in contextual bug deduplication while suffering only a minor loss in accuracy.

Online SVM based Optimized Bug Report Triaging using Feature Extraction

Triage is medical term referring to the process of prioritizing patients based on the severity of their condition so as to maximize benefit (help as many as possible) when resources are limited. Bug Report triaging is a process where tracker issues are screened and prioritized. Triage should help ensure that all reported issues are properly managed - bugs as well as improvements and feature requests. The large number of new bug reports received in bug repositories of software systems makes their management a challenging task. Handling these reports manually is time consuming, and often results in delaying the resolution of important bugs. The most critical issue related with bug reports is that their number is vast and most of these are duplicates of some previously sent bug report. The solution to this problem requires that bug reports are to be categorized in groups where each group consist of all the bug reports which belongs to the same bug, and the number of groups is equal to the number of unique bugs addressed so far. Bug report corresponding to some new bug is to be placed in a separate group followed by its duplicates, if any. Classifying weather a bug report that arrived through a user, written in a natural language, is a duplicate or unique report is a time consuming task, especially when the number of bug reports that are received is large. Thus, this process needs to be automated. Bug reports have textual, contextual and categorical features and these features needs to be extracted for checking of duplicates and non duplicates. Moreover, in the group of reports, a particular report can be specified as master and all the reports that corresponds to the same bug are to be linked to it. Thus, duplicates need not be discarded so as to provide later, a complete description of the bug. In this paper, a much more extended set of textual features is considered for bug report duplicacy checking. Support Vector Machine classifier is used for classification of the incoming bug report as duplicate of non-duplicates. The simulation of the prescribed model is done using R Statistical Package. A sample of bug reports from Mozilla repository is considered, Results of the simulation model establishes the fact that Proposed classifier has higher efficiency as compared to existing technique BM25F which employs 25 feature sets.

A Dataset of High Impact Bugs: Manually-Classified Issue Reports

2015

The importance of supporting test and maintenance activities in software development has been increasing, since recent software systems have become large and complex. Although in the field of Mining Software Repositories (MSR) there are many promising approaches to predicting, localizing, and triaging bugs, most of them do not consider impacts of each bug on users and developers but rather treat all bugs with equal weighting, excepting a few studies on high impact bugs including security, performance, blocking, and so forth. To make MSR techniques more actionable and effective in practice, we need deeper understandings of high impact bugs. In this paper we introduced our dataset of high impact bugs which was created by manually reviewing four thousand issue reports in four open source projects (Ambari, Camel, Derby and Wicket).

Bug Report Triaging Using Textual, Categorical and Contextual Features Using Latent Dirichlet Allocation

Software Bugs occur for a wide range of reasons. Bug reports can be generated automatically or drafted by user of software. Bug reports can also go with other malfunctions of the software, mostly for the beta or unsteady versions of the software. Most often, these bug reports are improved with user contributed experiences as to know what in fact faced by him/her. Addressing these bugs accounts for the majority of effort spent in the maintenance phase of a software project life cycle. Most often, several bug reports, sent by different users, match up to the same defect. Nevertheless, every bug report is to be analyzed separately and carefully for the possibility of a potential bug. The person responsible for processing the newly reported bugs, checking for duplicates and passing them to suitable developers to get fixed is called a Triager and this process is called Triaging. The utility of bug tracking systems is hindered by a large number of duplicate bug reports. In many open source software projects, as many as one third of all reports are duplicates. This identification of duplicacy in bug reports is time-taking and adds to the already high cost of software maintenance. In this dissertation, a model of automated triaging process is proposed based on textual, categorical and contextual similarity features. The contribution of this dissertation is twofold. In the proposed scheme a total of 80 textual features are extracted from the bug reports. Moreover, topics are modeled from the complete set of text corpus using Latent Dirichlet Allocation (LDA). These topics are specific to the category, class or functionality of the software. For e.g., possible list of topics for android bug repository might be Bluetooth, Download, Network etc. Bug reports are analyzed for context, to relate them to the domain specific topics of the software, thereby; enhancing the feature set which is used for tabulating similarity score. Finally, two sets are made for duplicates and non-duplicate bug reports for binary classification using Support Vector Machine. Simulation is performed over a dataset of Bugzilla. The proposed system improves the efficiency of duplicacy checking by 15 % as compared to the contextual model proposed by Anahita Alipour et.al. The system is able to reduce development cost by improvising the duplicity checking while allowing at least one bug report for each real defect to reach developers.

Automatic clustering of bug reports

International Journal of Advanced Computer Research

It is widely accepted that most development cost is spent for maintenance and most of the maintenance cost is spent on comprehension. Maintainers need to understand the current status of the code before updating it. For this reason, they examine pervious change requests and previous code changes to understand how the current code was evolved. The problem that faces them is how to locate related previous change requests that handled a specific feature or topic in the code. Quickly locating previous related change requests help developers to quickly understand the current status of the code and hence reduce the maintenance cost which is our ultimate goal. This paper proposes an automated technique to identify related previous change requests stored in bug reports. The technique is based on clustering bug reports based on their textual similarities. The result of the clustering is disjoint clusters of related bug reports that have common issues, topic or feature. A set of terms is extracted from each cluster, as tags, to help maintainers to understand the issue, topic or feature handled by the bug reports in the cluster. An experimental study is applied and discussed, followed by manual evaluation of the bug reports in the generated clusters.