Software Engineering Domain Knowledge to Identify Duplicate Bug Reports (original) (raw)
Related papers
DURFEX: A Feature Extraction Technique for Efficient Detection of Duplicate Bug Reports
The detection of duplicate bug reports can help reduce the processing time of handling field crashes. This is especially important for software companies with a large client base where multiple customers can submit bug reports, caused by the same faults. There exist several techniques for the detection of duplicate bug reports; many of them rely on some sort of classification techniques applied to information extracted from stack traces. They classify each report using functions invoked in the stack trace associated with the bug report. The problem is that typical bug repositories may have stack traces that contain tens of thousands of functions, which causes the curse of dimensionality problem. In this paper, we propose a feature extraction technique that reduces the feature size and yet retains the information that is most critical for the classification. The proposed feature extraction approach starts by abstracting stack traces of function calls into sequences of package names, by replacing each function with the package in which it is defined. We then segment these traces into multiple N-grams of variable length and map them to fixed-size sparse feature vectors, which are used to measure the distance between the stack trace of incoming bug report with a historical set of bug reports stack traces. The linear combination of stack trace similarity and non-textual fields such as component and severity are then used to measure the distance of a bug report with a historical set of bug reports. We show the effectiveness of our approach by applying it to the Eclipse bug repository that contains tens of thousands of bug reports. Our approach outperforms the approach that uses distinct function names, while significantly reducing the processing time.
ANALYZING THE IMPACT OF SIMILARITY MEASURES IN DUPLICATE BUG REPORT DETECTION
IAEME PUBLICATION, 2020
Duplicate Bug Report Detection is one of the very important tasks which is done during the assignment of bug reports to the concerned developer. As the Bug Reports of Open-Source projects are usually submitted by persons all over the geographical locations, the submission process is uncoordinated. Moreover this un coordinated submission leads to duplicate bug reports also. Bug Report Triager has to usually go through the tedious process of manually detecting the duplicate bug reports. Automatic Duplicate Bug Report Detection assists in easing the work of detection of duplicate bug reports. Survey shows that calculation of bug reports on the basis of similarity measures is the best way to perform this task of duplicate bug report detection task as the unbalanced data leads to imbalancing problem for machine learning approach. In this paper, we analyze how the different similarity measures impact the task of duplicate bug reports. For our analysis purpose, we have used Levenshtein, Jaccard, Cosine, BM25 , LSI and K-Means similarity measures. By including these similarity measures for the analysis purpose, Natural Language Processing, Machine Learning and Information Retrieval techniques are covered.
Software Bugs occur for a wide range of reasons. Bug reports can be generated automatically or drafted by user of software. Bug reports can also go with other malfunctions of the software, mostly for the beta or unsteady versions of the software. Most often, these bug reports are improved with user contributed experiences as to know what in fact faced by him/her. Addressing these bugs accounts for the majority of effort spent in the maintenance phase of a software project life cycle. Most often, several bug reports, sent by different users, match up to the same defect. Nevertheless, every bug report is to be analyzed separately and carefully for the possibility of a potential bug. The person responsible for processing the newly reported bugs, checking for duplicates and passing them to suitable developers to get fixed is called a Triager and this process is called Triaging. The utility of bug tracking systems is hindered by a large number of duplicate bug reports. In many open source software projects, as many as one third of all reports are duplicates. This identification of duplicacy in bug reports is time-taking and adds to the already high cost of software maintenance. In this dissertation, a model of automated triaging process is proposed based on textual, categorical and contextual similarity features. The contribution of this dissertation is twofold. In the proposed scheme a total of 80 textual features are extracted from the bug reports. Moreover, topics are modeled from the complete set of text corpus using Latent Dirichlet Allocation (LDA). These topics are specific to the category, class or functionality of the software. For e.g., possible list of topics for android bug repository might be Bluetooth, Download, Network etc. Bug reports are analyzed for context, to relate them to the domain specific topics of the software, thereby; enhancing the feature set which is used for tabulating similarity score. Finally, two sets are made for duplicates and non-duplicate bug reports for binary classification using Support Vector Machine. Simulation is performed over a dataset of Bugzilla. The proposed system improves the efficiency of duplicacy checking by 15 % as compared to the contextual model proposed by Anahita Alipour et.al. The system is able to reduce development cost by improvising the duplicity checking while allowing at least one bug report for each real defect to reach developers.
Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection
arXiv (Cornell University), 2022
About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicate bug reports that are textually similar. However, in bug tracking systems, many duplicate bug reports might not be textually similar, for which the traditional techniques might fall short. In this paper, we conduct a large-scale empirical study to better understand the impacts of textual dissimilarity on the detection of duplicate bug reports. First, we collect a total of 92,854 bug reports from three open-source systems and construct two datasets containing textually similar and textually dissimilar duplicate bug reports. Then we determine the performance of three existing techniques in detecting duplicate bug reports and show that their performance is significantly poor for textually dissimilar duplicate reports. Second, we analyze the two groups of bug reports using a combination of descriptive analysis, word embedding visualization, and manual analysis. We found that textually dissimilar duplicate bug reports often miss important components (e.g., expected behaviors and steps to reproduce), which could lead to their textual differences and poor performance by the existing techniques. Finally, we apply domain-specific embedding to duplicate bug report detection problems, which shows mixed results. All these findings above warrant further investigation and more effective solutions for detecting textually dissimilar duplicate bug reports.
Duplicate bug report detection with a combination of information retrieval and topic modeling
2012
Abstract Detecting duplicate bug reports helps reduce triaging efforts and save time for developers in fixing the same issues. Among several automated detection approaches, text-based information retrieval (IR) approaches have been shown to outperform others in term of both accuracy and time efficiency. However, those IR-based approaches do not detect well the duplicate reports on the same technical issues written in different descriptive terms.
An HMM-based approach for automatic detection and classification of duplicate bug reports
Information and Software Technology, 2019
Context: Software projects rely on their issue tracking systems to guide maintenance activities of software developers. Bug reports submitted to the issue tracking systems carry crucial information about the nature of the crash (such as texts from users or developers and execution information about the running functions before the occurrence of a crash). Typically, big software projects receive thousands of reports every day. Objective: The aim is to reduce the time and effort required to fix bugs while improving software quality overall. Previous studies have shown that a large amount of bug reports are duplicates of previously reported ones. For example, as many as 30% of all reports in for Firefox are duplicates. Method: While there exist a wide variety of approaches to automatically detect duplicate bug reports by natural language processing, only a few approaches have considered execution information (the so-called stack traces) inside bug reports. In this paper, we propose a novel approach that automatically detects duplicate bug reports using stack traces and Hidden Markov Models. Results: When applying our approach to Firefox and GNOME datasets, we show that, for Firefox, the average recall for Rank k =1 is 59%, for Rank k=2 is 75.55%. We start reaching the 90% recall from k=10. The Mean Average Precision (MAP) value is up to 76.5%. For GNOME, The recall at k=1 is around 63%, while this value increases by about 10% for k=2. The recall increases to 97% for k=11. A MAP value of up to 73% is achieved. Conclusion: We show that HMM and stack traces are a powerful combination for detecting and classifying duplicate bug reports in large bug repositories.
Automatic clustering of bug reports
International Journal of Advanced Computer Research
It is widely accepted that most development cost is spent for maintenance and most of the maintenance cost is spent on comprehension. Maintainers need to understand the current status of the code before updating it. For this reason, they examine pervious change requests and previous code changes to understand how the current code was evolved. The problem that faces them is how to locate related previous change requests that handled a specific feature or topic in the code. Quickly locating previous related change requests help developers to quickly understand the current status of the code and hence reduce the maintenance cost which is our ultimate goal. This paper proposes an automated technique to identify related previous change requests stored in bug reports. The technique is based on clustering bug reports based on their textual similarities. The result of the clustering is disjoint clusters of related bug reports that have common issues, topic or feature. A set of terms is extracted from each cluster, as tags, to help maintainers to understand the issue, topic or feature handled by the bug reports in the cluster. An experimental study is applied and discussed, followed by manual evaluation of the bug reports in the generated clusters.
Measuring the semantic similarity of comments in bug reports
2008
Abstract Bug-tracking systems, such as Bugzilla, contain a large amount of information about software defects, most of it stored in textual, rather than structured form. This information is used not only for locating and fixing the bugs, but also for detecting bug duplicates, triaging incoming bugs, automatically assigning bugs to developers, etc. Given the importance of the textual information in the bug reports, it is desirable that this text is highly coherent, such that the readers can easily understand it.
Duplication Detection for Software Bug Reports Based on BM25 Term Weighting
2012 Conference on Technologies and Applications of Artificial Intelligence, 2012
Handling bug reports is an important issue in software maintenance. Recently, detection on duplicate bug reports has received much attention. There are two main reasons. First, duplicate bug reports may waste human resource to process these redundant reports. Second, duplicate bug reports may provide abundant information for further software maintenance. In the past studies, many schemes have been proposed using the information retrieval and natural language processing techniques. In this thesis, we propose a novel detection scheme based on a BM25 term weighting scheme. We have conducted empirical experiments on three open source projects, Apache, ArgoUML, and SVN. The experimental results show that the BM25-based scheme can effectively improve the detection performance in nearly all cases.