Textual Analysis and Software Quality: Challenges and Opportunities (original) (raw)

Expanding identifiers to normalize source code vocabulary

2011

Abstract Maintaining modern software requires significant tool support. Effective tools exploit a variety of information and techniques to aid a software maintainer. One area of recent interest in tool development exploits the natural language information found in source code. Such Information Retrieval (IR) based tools compliment traditional static analysis tools and have tackled problems, such as feature location, that otherwise require considerable human effort.

Text mining and software engineering: an integrated source code and document analysis approach

IET Software, 2008

Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. In this paper, we present a text mining system capable of populating a software ontology with information detected in documents. A particular novelty is the integration of results from automated source code analysis into a natural language processing (NLP) pipeline, allowing to cross-link software artifacts represented in code and natural language on a semantic level.

Labeling source code with information retrieval methods: an empirical study

Empirical Software Engineering, 2013

Context: To support program comprehension, software artifacts can be labeled-for example within software visualization tools-with a set of representative words, hereby referred as labels. Such labels can be obtained using various approaches, including Information Retrieval (IR) methods or other simple heuristics. They provide a bird-eye's view of the source code, allowing developers to look over software components fast and make more informed decisions on which parts of the source code they need to analyze in detail. However, few empirical studies have been conducted to verify whether the extracted labels make sense to software developers. Aim: This paper investigates (i) to what extent various IR techniques and other simple heuristics overlap with (and differ from) labeling performed by humans, (ii) what kinds of source code terms do humans use when labeling software artifacts, and (iii) what factors-in particular what characteristics of the artifacts to be labeled--influence the performance of automatic labeling techniques. Method: We conducted two experiments in which we asked a group of subjects (38 in total) to label 20 classes from two Java software systems, JHotDraw and eXVantage. Then, we analyzed to what extent the words identified with an automated technique (including Vector Space Models, Latent Semantic Indexing, latent Dirichlet allocation, as well as customized heuristics extracting words from specific source code elements) overlap with those identified by humans. Results: Results indicate that, in most cases, simpler automatic labeling techniques-based on the use of words extracted from class and method names as well as from class commentsbetter reflect human-based labeling. Indeed, clustering-based approaches (LSI and LDA) are

Normalizing source code vocabulary

2010

Abstract Information Retrieval (IR) based tools complement traditional static and dynamic analysis tools by exploiting the natural language found within a program's text. Tools incorporating IR have tackled problems, such as feature location, that previously required considerable human effort. However, to reap the full benefit of IR-based techniques, the language used across all software artifacts (eg, requirement and design documents, test plans, as well as the source code) must be consistent.

Descriptive compound identifier names improve source code comprehension

Proceedings of the 26th Conference on Program Comprehension, 2018

Reading and understanding source code is a major task in software development. Code comprehension depends on the quality of code, which is impacted by code structure and identifier naming. In this paper we empirically investigated whether longer but more descriptive identifier names improve code comprehension compared to short names, as they represent useful information in more detail. In a web-based study 88 Java developers were asked to locate a semantic defect in source code snippets. With descriptive identifier names, developers spent more time in the lines of code before the actual defect occurred and changed their reading direction less often, finding the semantic defect about 14% faster than with shorter but less descriptive identifier names. These effects disappeared when developers searched for a syntax error, i.e., when no in-depth understanding of the code was required. Interestingly, the style of identifier names had a clear impact on program comprehension for more experienced developers but not for less experienced developers. CCS CONCEPTS • Human-centered computing → Empirical studies in HCI; • Software and its engineering → Software usability; Error handling and recovery; Maintaining software;

Semantic Impact and Faults in Source Code Changes: An Empirical Study

2009 Australian Software Engineering Conference, 2009

Changes to source code have become a critical factor in fault predictions. Text or syntactic approaches have been widely used. Textual analysis focuses on changed text fragments while syntactic analysis focuses on changed syntactic entities. Although both of them have demonstrated their advantages in experimental results, they only study code fragments modified during changes. Because of semantic dependencies within programs, we believe that code fragments impacted by changes are also helpful. Given a source code change, we identify its impact by program slicing along the variable defuse chains. To evaluate the effectiveness of change impacts in fault detection and prediction, we compare impacted code with changed code according to size and fault density. Our experiment on the change history of a successful industrial project shows that: fault density in changed and impacted fragments are higher than other areas; for large changes, their impacts have higher fault density than changes themselves; interferences within change impact contribute to the high fault density in large changes. Our study suggests that, like change itself, change impact is also a high priority indicator in fault prediction, especially for changes of large scales.

Semantic clustering: Identifying topics in source code

2007

Many of the existing approaches in Software Comprehension focus on program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use of information retrieval to exploit linguistic information found in source code, such as identifier names and comments.

Using IR methods for labeling source code artifacts: Is it worthwhile?

2012 20th IEEE International Conference on Program Comprehension (ICPC), 2012

Information Retrieval (IR) techniques have been used for various software engineering tasks, including the labeling of software artifacts by extracting "keywords" from them. Such techniques include Vector Space Models, Latent Semantic Indexing, Latent Dirichlet Allocation, as well as customized heuristics extracting words from specific source code elements. This paper investigates how source code artifact labeling performed by IR techniques would overlap (and differ) with labeling performed by humans. This has been done by asking a group of subjects to label 20 classes from two Java software systems, JHotDraw and eXVantage. Results indicate that, in most cases, automatic labeling would be more similar to human-based labeling if using simpler techniques, while clustering-based approaches (LSI and LDA) are much more worthwhile to be used on source code artifacts having a high verbosity and that required more effort to be manually labeled.

Analyzing the evolution of the source code vocabulary

European Conference …, 2009

Source code is a mixed software artifact, containing information for both the compiler and the developers. While programming language grammar dictates how the source code is written, developers have a lot of freedom in writing identifiers and comments. These are intentional in nature and become means of communication between developers. The goal of this paper is to analyze how the source code vocabulary changes during evolution, through an exploratory study of two software systems. Specifically, we collected ...