An empirical investigation into a large-scale Java open source code repository (original) (raw)
Related papers
The adherence of open source java programmers to standard coding practices
The 6th IASTED International Conference on Software …, 2002
The use of agreed-upon coding practices is believed to enhance program comprehension, which directly affects reuse and maintainability. This paper describes a controlled small-scale experiment that tries to determine how well open source Java programmers adhere to a set of well publicized coding practices. The experiment evaluated 100 arbitrarily selected open source Java classes from different programmers with respect to 16 standard coding practices. The results of this experiment indicate that open source Java programmers do not always adhere to standard coding practices. It was found that only 4% of the subject classes have no violations to any of the 16 standard coding practices and there were only 5 of 16 coding practices that all subjects followed. It was also found that there are positive correlations between the number of violations found in a class and its lines-of-code, number of methods, and number of attributes.
Understanding the shape of Java software
ACM SIGPLAN Notices, 2006
Large amounts of Java software have been written since the language's escape into unsuspecting software ecology more than ten years ago. Surprisingly little is known about the structure of Java programs in the wild: about the way methods are grouped into classes and then into packages, the way packages relate to each other, or the way inheritance and composition are used to put these programs together. We present the results of the first in-depth study of the structure of Java programs. We have collected a number of Java programs and measured their key structural attributes. We have found evidence that some relationships follow power-laws, while others do not. We have also observed variations that seem related to some characteristic of the application itself. This study provides important information for researchers who can investigate how and why the structural relationships we find may have originated, what they portend, and how they can be managed.
Prevalence of ‘Atoms of Confusion’ in Open Source Java Systems: An Empirical Study
Authorea (Authorea), 2023
Atoms of confusion, or simply "atoms," are pieces of code that lead to misunderstanding while being interpreted. Previous research has shown that the presence of atoms has an effect on code readability. Aside from simple misunderstanding in lab setting, atoms of confusion are common and meaningful in open source C and C++ projects, and are thus removed by bug-fix commits. However, due to syntactical differences between language paradigms, the prevalence of atoms may vary in projects written in other languages (e.g. Java), which is yet to be explored. In this study, the first step is taken towards investigating the prevalence of 12 different atoms in the 13 most popular open-source Java projects. The relationship between the presence of atoms and aspects of code maintainability is also studied. Results show that, atoms are 4.7 time more prevalent in Java projects compared to open source C/C++ projects based on occurrence per line. For a total of 1085223 atoms in our corpus, they occur once every 4.8 lines. Some atoms are very obscure (e.g. the Logic As Control Flow atom which occurs once in 440060 lines). Some atoms are frequently occurring (e.g. the Infix Operator Precedence atom which occurs once in 6.4 lines). Impact of the presence of atoms on code maintainability is also explored. Besides, correlation between atoms are investigated. Results indicate that object oriented metrics contribute less in atom prevalence, whereas fine grained code-metrics have relatively better association.
DR-Tools: a suite of lightweight open-source tools to measure and visualize Java source code
ICSME 2020 (Tool Demo Track), 2020
In Software Engineering, some of the most critical activities are maintenance and evolution. However, to perform both with quality, minimizing impacts and risks, developers need to analyze and identify where the main problems come from previously. In this paper, we introduce DR-Tools Suite, a set of lightweight open-source tools that analyze and calculate source code metrics, allowing developers to visualize the results in different formats and graphs. Also, we define a set of heuristics to help the code analysis. We conducted two case studies (one academic and one industrial) to collect feedback on the tools suite, on how we will evolve the tools, as well as insights to develop new tools that support developers in their daily work. Videos: https://bit.ly/30weexX
An empirical study of package coupling in Java open-source
2010
Excessive coupling between object-oriented classes in systems is generally acknowledged as harmful and is recognised as a maintenance problem that can result in a higher propensity for faults in systems and a "stored up" future problem. Characterisation and understanding coupling at different levels of abstraction is therefore important for both the project manager and developer both of whom have a vested interest in software quality. In this Thesis, coupling trends are empirically investigated over multiple versions of seven Java open-source systems (OSS). The first investigation explores the trends in longitudinal changes to open-V TABLE OF CONTENTS
A Study of" Wheat" and" Chaff" in Source Code
Natural language is robust against noise. The meaning of many sentences survives the loss of words, sometimes many of them. Some words in a sentence, however, cannot be lost without changing the meaning of the sentence. We call these words "wheat" and the rest "chaff". The word "not" in the sentence "I do not like rain" is wheat and "do" is chaff. For human understanding of the purpose and behavior of source code, we hypothesize that the same holds. To quantify the extent to which we can separate code into "wheat" and "chaff", we study a large (100M LOC), diverse corpus of real-world projects in Java. Since methods represent natural, likely distinct units of code, we use the ∼9M Java methods in the corpus to approximate a universe of "sentences." We "thresh", or lex, functions, then "winnow" them to extract their wheat by computing the function's minimal distinguishing subset (MINSET). Our results confirm that programs contain much chaff. On average, MINSETS have 1.56 words (none exceeds 6) and comprise 4% of their methods. Beyond its intrinsic scientific interest, our work offers the first quantitative evidence for recent promising work on keywordbased programming and insight into how to develop powerful, alternative programming systems.
CAM: A Collection of Snapshots of GitHub Java Repositories Together with Metrics
arXiv (Cornell University), 2024
Even though numerous researchers require stable datasets along with source code and basic metrics calculated on them, neither GitHub nor any other code hosting platform provides such a resource. Consequently, each researcher must download their own data, compute the necessary metrics, and then publish the dataset somewhere to ensure it remains accessible indefinitely. Our CAM (stands for "Classes and Metrics") project addresses this need. It is an opensource software capable of cloning Java repositories from GitHub, filtering out unnecessary files, parsing Java classes, and computing metrics such as Cyclomatic Complexity, Halstead Effort and Volume, C&K metrics, Maintainability Metrics, LCOM5 and HND, as well as some Git-based Metrics. At least once a year, we execute the entire script, a process which requires a minimum of ten days on a very powerful server, to generate a new dataset. Subsequently, we publish it on Amazon S3, thereby ensuring its availability as a reference for researchers. The latest archive of 2.2Gb that we published on the 2nd of March, 2024 includes 532K Java classes with 48 metrics for each class.
The Order of Things: How developers sort fields and methods
2012 28th IEEE International Conference on Software Maintenance (ICSM), 2012
In source code files, fields and methods are arranged in linear order. Modern programming languages such as Java do not constrain this order-developers are free to choose any sequence. In this paper we examine the largely unexplored strategies developers apply for ordering fields and methods: First, we use visualization to explore different ordering criteria within two open source Java projects. Second, we verify our observations in a metric-based analysis on an extended set of 16 projects. Third, we present the results of a survey that reflects the opinion and applied ordering strategies of 52 developers. 87% of the participants agreed that ordering of fields and methods is meaningful or important. Our results suggest that there exists a set of criteria repeatedly used for ordering. Among these, the categories defined in the official Java Code Conventions appear to be the primary ordering criterion. However, in the individual strategies of the participants of the survey, we identified 15 ordering criteria additional to the five criteria we considered in the empirical analysis.
Analyzing code evolution to uncover relations
2015 IEEE 2nd International Workshop on Patterns Promotion and Anti-patterns Prevention (PPAP), 2015
This paper reports on evidence found of five possible relations (Plain Support, Mutual Support, Rejection, Common Refactoring, and Inclusion) among four bad smells (God Class, Long Method, Feature Envy, and Type Checking). We analyzed several releases of three open-source applications (16 for Log4j, 34 for Jmol, and 45 for JFreeChart) using four direct and two indirect metrics. This analysis uncovered correlations between three of these bad smells, namely, Feature Envy, Long Method, and God Class. The strongest correlation discovered was between Feature Envy and Long Method, followed by a mild correlation between Long Method and God Class, and between Feature Envy and God Class. These findings seem to provide initial evidence of the coexistence of bad smells and therefore, the need for bad smell removal plans to take into account these correlations in order to minimize code improvement efforts.