Themistoklis Diamantopoulos | Aristotle University of Thessaloniki (original) (raw)
Papers by Themistoklis Diamantopoulos
Springer eBooks, 2022
Nowadays, software development is accelerated through the reuse of code snippets found online in ... more Nowadays, software development is accelerated through the reuse of code snippets found online in question-answering platforms and software repositories. In order to be efficient, this process requires forming an appropriate query and identifying the most suitable code snippet, which can sometimes be challenging and particularly time-consuming. Over the last years, several code recommendation systems have been developed to offer a solution to this problem. Nevertheless, most of them recommend API calls or sequences instead of reusable code snippets. Furthermore, they do not employ architectures advanced enough to exploit the semantics of natural language and code in order to form the optimal query from the question posed. To overcome these issues, we propose CodeTransformer, a code recommendation system that provides useful, reusable code snippets extracted from open-source GitHub repositories. By employing a neural network architecture that comprises advanced attention mechanisms, our system effectively understands and models natural language queries and code snippets in a joint vector space. Upon evaluating CodeTransformer quantitatively against a similar system and qualitatively using a dataset from Stack Overflow, we conclude that our approach can recommend useful and reusable snippets to developers.
Most software teams nowadays host their projects online and monitor software development in the f... more Most software teams nowadays host their projects online and monitor software development in the form of issues/tasks. This process entails communicating through comments and reporting progress through commits and closing issues. In this context, assigning new issues, tasks or bugs to the most suitable contributor largely improves efficiency. Thus, several automated issue assignment approaches have been proposed, which however have major limitations. Most systems focus only on assigning bugs using textual data, are limited to projects explicitly using bug tracking systems, and may require manually tuning parameters per project. In this work, we build an automated issue assignment system for GitHub, taking into account the commits and issues of the repository under analysis. Our system aggregates feature probabilities using a neural network that adapts to each project, thus not requiring manual parameter tuning. Upon evaluating our methodology, we conclude that it can be efficient for automated issue assignment.
When developers search online to find software components to reuse, they usually first need to un... more When developers search online to find software components to reuse, they usually first need to understand the container projects/libraries, and subsequently identify the required functionality. Several approaches identify and summarize the offerings of projects from their source code, however they often require that the developer has knowledge of the underlying topic modeling techniques; they do not provide a mechanism for tuning the number of topics, and they offer no control over the top terms for each topic. In this work, we use a vectorizer to extract information from variable/method names and comments, and apply Latent Dirichlet Allocation to cluster the source code files of a project into different semantic topics. The number of topics is optimized based on their purity with respect to project packages, while topic categories are constructed to provide further intuition and Stack Exchange tags are used to express the topics in more abstract terms.
International Conference on Software Engineering Advances, Oct 27, 2013
Locating software bugs is a difficult task, especially if they do not lead to crashes. Current re... more Locating software bugs is a difficult task, especially if they do not lead to crashes. Current research on automating non-crashing bug detection dictates collecting function call traces and representing them as graphs, and reducing the graphs before applying a subgraph mining algorithm. A ranking of potentially buggy functions is derived using frequency statistics for each node (function) in the correct and incorrect set of traces. Although most existing techniques are effective, they do not achieve scalability. To address this issue, this paper suggests reducing the graph dataset in order to isolate the graphs that are significant in localizing bugs. To this end, we propose the use of tree edit distance algorithms to identify the traces that are closer to each other, while belonging to different sets. The scalability of two proposed algorithms, an exact and a faster approximate one, is evaluated using a dataset derived from a real-world application. Finally, although the main scope of this work lies in scalability, the results indicate that there is no compromise in effectiveness.
With the help of project management tools and code hosting facilities, software development has b... more With the help of project management tools and code hosting facilities, software development has been transformed into an easy-to-decentralize business. However, determining the importance of tasks within a software engineering process in order to better prioritize and act on has always been an interesting challenge. Although several approaches on bug severity/priority prediction exist, the challenge of task importance prediction has not been sufficiently addressed in current research. Most approaches do not consider the meta-data and the temporal characteristics of the data, while they also do not take into account the ordinal characteristics of the importance/severity variable. In this work, we analyze the challenge of task importance prediction and propose a prototype methodology that extracts both textual (titles, descriptions) and meta-data (type, assignee) characteristics from tasks and employs a sliding window technique to model their time frame. After that, we evaluate three different prediction methods, a multi-class classifier, a regression algorithm, and an ordinal classification technique, in order to assess which model is the most effective for encompassing the relative ordering between different importance values. The results of our evaluation are promising, leaving room for future research.
Nowadays, software development has been greatly influenced by question-answering communities, suc... more Nowadays, software development has been greatly influenced by question-answering communities, such as Stack Overflow. A new problem-solving paradigm has emerged, as developers post problems they encounter that are then answered by the community. In this paper, we propose a methodology that allows searching for solutions in Stack Overflow, using the main elements of a question post, including not only its title, tags, and body, but also its source code snippets. We describe a similarity scheme for these elements and demonstrate how structural information can be extracted from source code snippets and compared to further improve the retrieval of questions. The results of our evaluation indicate that our methodology is effective on recommending similar question posts allowing community members to search without fully forming a question.
Data
The availability of code snippets in online repositories like GitHub has led to an uptick in code... more The availability of code snippets in online repositories like GitHub has led to an uptick in code reuse, this way further supporting an open-source component-based development paradigm. The likelihood of code reuse rises when the code components or snippets are of high quality, especially in terms of readability, making their integration and upkeep simpler. Toward this direction, we have developed a dataset of code snippets that takes into account both the functional and the quality characteristics of the snippets. The dataset is based on the CodeSearchNet corpus and comprises additional information, including static analysis metrics, code violations, readability assessments, and source code similarity metrics. Thus, using this dataset, both software researchers and practitioners can conveniently find and employ code snippets that satisfy diverse functional needs while also demonstrating excellent readability and maintainability.
Proceedings of the 18th International Conference on Software Technologies
Proceedings of the 18th International Conference on Software Technologies
IET Software
As more and more software teams use online issue tracking systems to collaborate on software proj... more As more and more software teams use online issue tracking systems to collaborate on software projects, the accurate assignment of new issues to the most suitable contributors may have significant impact on the success of the project. As a result, several research efforts have been directed towards automating this process to save considerable time and effort. However, most approaches focus mainly on software bugs and employ models that do not sufficiently take into account the semantics and the non‐textual metadata of issues and/or produce models that may require manual tuning. A methodology that extracts both textual and non‐textual features from different types of issues is designed, providing a Jira dataset that involves not only bugs but also new features, issues related to documentation, patches, etc. Moreover, the semantics of issue text are effectively captured by employing a topic modelling technique that is optimised using the assignment result. Finally, this methodology agg...
Lecture Notes in Computer Science, 2022
Nowadays, software development is accelerated through the reuse of code snippets found online in ... more Nowadays, software development is accelerated through the reuse of code snippets found online in question-answering platforms and software repositories. In order to be efficient, this process requires forming an appropriate query and identifying the most suitable code snippet, which can sometimes be challenging and particularly time-consuming. Over the last years, several code recommendation systems have been developed to offer a solution to this problem. Nevertheless, most of them recommend API calls or sequences instead of reusable code snippets. Furthermore, they do not employ architectures advanced enough to exploit the semantics of natural language and code in order to form the optimal query from the question posed. To overcome these issues, we propose CodeTransformer, a code recommendation system that provides useful, reusable code snippets extracted from open-source GitHub repositories. By employing a neural network architecture that comprises advanced attention mechanisms, our system effectively understands and models natural language queries and code snippets in a joint vector space. Upon evaluating CodeTransformer quantitatively against a similar system and qualitatively using a dataset from Stack Overflow, we conclude that our approach can recommend useful and reusable snippets to developers.
2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
The current state of practice dictates that in order to solve a problem encountered when building... more The current state of practice dictates that in order to solve a problem encountered when building software, developers ask for help in online platforms, such as Stack Overflow. In this context of collaboration, answers to question posts often undergo several edits to provide the best solution to the problem stated. In this work, we explore the potential of mining Stack Overflow answer edits to extract common patterns when answering a post. In particular, we design a similarity scheme that takes into account the text and code of answer edits and cluster edits according to their semantics. Upon applying our methodology, we provide frequent edit patterns and indicate how they could be used to answer future research questions. Our evaluation indicates that our approach can be effective for identifying commonly applied edits, thus illustrating the transformation path from the initial answer to the optimal solution.
Proceedings of the 14th International Conference on Software Technologies
The sharing and growth of open source software packages in the npm JavaScript (JS) ecosystem has ... more The sharing and growth of open source software packages in the npm JavaScript (JS) ecosystem has been exponential, not only in numbers but also in terms of interconnectivity, to the extend that often the size of dependencies has become more than the size of the written code. This reuse-oriented paradigm, often attributed to the lack of a standard library in node and/or in the micropackaging culture of the ecosystem, yields interesting insights on the way developers build their packages. In this work we view the dependency network of the npm ecosystem from a "culinary" perspective. We assume that dependencies are the ingredients in a recipe, which corresponds to the produced software package. We employ network analysis and information retrieval techniques in order to capture the dependencies that tend to co-occur in the development of npm packages and identify the communities that have been evolved as the main drivers for npm's exponential growth.
Nowadays, developers tend to adopt a component-based software engineering approach, reusing own i... more Nowadays, developers tend to adopt a component-based software engineering approach, reusing own implementations and/or resorting to third-party source code. This practice is in principle cost-effective, however it may also lead to low quality software products, if the components to be reused exhibit low quality. Thus, several approaches have been developed to measure the quality of software components. Most of them, however, rely on the aid of experts for defining target quality scores and deriving metric thresholds, leading to results that are context-dependent and subjective. In this work, we build a mechanism that employs static analysis metrics extracted from GitHub projects and defines a target quality score based on repositories’ stars and forks, which indicate their adoption/acceptance by developers. Upon removing outliers with a one-class classifier, we employ Principal Feature Analysis and examine the semantics among metrics to provide an analysis on five axes for source co...
IFIP Advances in Information and Communication Technology, 2020
The increase of the adoption of IoT devices and the contemporary problem of food production have ... more The increase of the adoption of IoT devices and the contemporary problem of food production have given rise to numerous applications of IoT in agriculture. These applications typically comprise a set of sensors that are installed in open fields and measure metrics, such as temperature or humidity, which are used for irrigation control systems. Though useful, most contemporary systems have high installation and maintenance costs, and they do not offer automated control or, if they do, they are usually not interpretable, and thus cannot be trusted for such critical applications. In this work, we design Vital, a system that incorporates a set of low-cost sensors, a robust data store, and most importantly an explainable AI decision support system. Our system outputs a fuzzy rule-base, which is interpretable and allows fully automating the irrigation of the fields. Upon evaluating Vital in two pilot cases, we conclude that it can be effective for monitoring open-field installations.
Proceedings of the 17th International Conference on Mining Software Repositories, 2020
The full integration of online repositories in the contemporary software development process prom... more The full integration of online repositories in the contemporary software development process promotes remote work and remote collaboration. Apart from the apparent benefits, online repositories offer a deluge of data that can be utilized to monitor and improve the software development process. Towards this direction, we have designed and implemented a platform that analyzes data from GitHub in order to compute a series of metrics that quantify the contributions of project collaborators, both from a development as well as an operations (communication) perspective. We analyze contributions in an evolutionary manner throughout the projects' lifecycle and track the number of coding violations generated, this way aspiring to identify cases of software development that need closer monitoring and (possibly) further actions to be taken. In this context, we have analyzed the 3000 most popular Java GitHub projects and provide the data to the community.
Language Resources and Evaluation, 2017
Mapping functional requirements first to specifications and then to code is one of the most chall... more Mapping functional requirements first to specifications and then to code is one of the most challenging tasks in software development. Since requirements are commonly written in natural language, they can be prone to ambiguity, incompleteness and inconsistency. Structured semantic representations allow requirements to be translated to formal models, which can be used to detect problems at an early stage of the development process through validation. Storing and querying such models can also facilitate software reuse. Several approaches constrain the input format of requirements to produce specifications, however they usually require considerable human effort in order to adopt domain-specific heuristics and/or controlled languages. We propose a mechanism that automates the mapping of requirements to formal representations using semantic role labeling. We describe the first publicly available dataset for this task, employ a hierarchical framework that allows requirements concepts to be annotated, and discuss how semantic role labeling can be adapted for parsing software requirements.
2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), 2016
The popularity of open source software repositories and the highly adopted paradigm of software r... more The popularity of open source software repositories and the highly adopted paradigm of software reuse have led to the development of several tools that aspire to assess the quality of source code. However, most software quality estimation tools, even the ones using adaptable models, depend on fixed metric thresholds for defining the ground truth. In this work we argue that the popularity of software components, as perceived by developers, can be considered as an indicator of software quality. We present a generic methodology that relates quality with source code metrics and estimates the quality of software components residing in popular GitHub repositories. Our methodology employs two models: a one-class classifier, used to rule out low quality code, and a neural network, that computes a quality score for each software component. Preliminary evaluation indicates that our approach can be effective for identifying high quality software components in the context of reuse.
Proceedings of the Sixth International Symposium on Business Modeling and Software Design, 2016
In order to maintain, extend or reuse software projects one has to primarily understand what a sy... more In order to maintain, extend or reuse software projects one has to primarily understand what a system does and how well it does it. And, while in some cases information on system functionality exists, information covering the non-functional aspects is usually unavailable. Thus, one has to infer such knowledge by extracting design patterns directly from the source code. Several tools have been developed to identify design patterns, however most of them are limited to compilable and in most cases executable code, they rely on complex representations, and do not offer the developer any control over the detected patterns. In this paper we present DP-CORE, a design pattern detection tool that defines a highly descriptive representation to detect known and define custom patterns. DP-CORE is flexible, identifying exact and approximate pattern versions even in non-compilable code. Our analysis indicates that DP-CORE provides an efficient alternative to existing design pattern detection tools.
Automated Software Engineering, 2016
Springer eBooks, 2022
Nowadays, software development is accelerated through the reuse of code snippets found online in ... more Nowadays, software development is accelerated through the reuse of code snippets found online in question-answering platforms and software repositories. In order to be efficient, this process requires forming an appropriate query and identifying the most suitable code snippet, which can sometimes be challenging and particularly time-consuming. Over the last years, several code recommendation systems have been developed to offer a solution to this problem. Nevertheless, most of them recommend API calls or sequences instead of reusable code snippets. Furthermore, they do not employ architectures advanced enough to exploit the semantics of natural language and code in order to form the optimal query from the question posed. To overcome these issues, we propose CodeTransformer, a code recommendation system that provides useful, reusable code snippets extracted from open-source GitHub repositories. By employing a neural network architecture that comprises advanced attention mechanisms, our system effectively understands and models natural language queries and code snippets in a joint vector space. Upon evaluating CodeTransformer quantitatively against a similar system and qualitatively using a dataset from Stack Overflow, we conclude that our approach can recommend useful and reusable snippets to developers.
Most software teams nowadays host their projects online and monitor software development in the f... more Most software teams nowadays host their projects online and monitor software development in the form of issues/tasks. This process entails communicating through comments and reporting progress through commits and closing issues. In this context, assigning new issues, tasks or bugs to the most suitable contributor largely improves efficiency. Thus, several automated issue assignment approaches have been proposed, which however have major limitations. Most systems focus only on assigning bugs using textual data, are limited to projects explicitly using bug tracking systems, and may require manually tuning parameters per project. In this work, we build an automated issue assignment system for GitHub, taking into account the commits and issues of the repository under analysis. Our system aggregates feature probabilities using a neural network that adapts to each project, thus not requiring manual parameter tuning. Upon evaluating our methodology, we conclude that it can be efficient for automated issue assignment.
When developers search online to find software components to reuse, they usually first need to un... more When developers search online to find software components to reuse, they usually first need to understand the container projects/libraries, and subsequently identify the required functionality. Several approaches identify and summarize the offerings of projects from their source code, however they often require that the developer has knowledge of the underlying topic modeling techniques; they do not provide a mechanism for tuning the number of topics, and they offer no control over the top terms for each topic. In this work, we use a vectorizer to extract information from variable/method names and comments, and apply Latent Dirichlet Allocation to cluster the source code files of a project into different semantic topics. The number of topics is optimized based on their purity with respect to project packages, while topic categories are constructed to provide further intuition and Stack Exchange tags are used to express the topics in more abstract terms.
International Conference on Software Engineering Advances, Oct 27, 2013
Locating software bugs is a difficult task, especially if they do not lead to crashes. Current re... more Locating software bugs is a difficult task, especially if they do not lead to crashes. Current research on automating non-crashing bug detection dictates collecting function call traces and representing them as graphs, and reducing the graphs before applying a subgraph mining algorithm. A ranking of potentially buggy functions is derived using frequency statistics for each node (function) in the correct and incorrect set of traces. Although most existing techniques are effective, they do not achieve scalability. To address this issue, this paper suggests reducing the graph dataset in order to isolate the graphs that are significant in localizing bugs. To this end, we propose the use of tree edit distance algorithms to identify the traces that are closer to each other, while belonging to different sets. The scalability of two proposed algorithms, an exact and a faster approximate one, is evaluated using a dataset derived from a real-world application. Finally, although the main scope of this work lies in scalability, the results indicate that there is no compromise in effectiveness.
With the help of project management tools and code hosting facilities, software development has b... more With the help of project management tools and code hosting facilities, software development has been transformed into an easy-to-decentralize business. However, determining the importance of tasks within a software engineering process in order to better prioritize and act on has always been an interesting challenge. Although several approaches on bug severity/priority prediction exist, the challenge of task importance prediction has not been sufficiently addressed in current research. Most approaches do not consider the meta-data and the temporal characteristics of the data, while they also do not take into account the ordinal characteristics of the importance/severity variable. In this work, we analyze the challenge of task importance prediction and propose a prototype methodology that extracts both textual (titles, descriptions) and meta-data (type, assignee) characteristics from tasks and employs a sliding window technique to model their time frame. After that, we evaluate three different prediction methods, a multi-class classifier, a regression algorithm, and an ordinal classification technique, in order to assess which model is the most effective for encompassing the relative ordering between different importance values. The results of our evaluation are promising, leaving room for future research.
Nowadays, software development has been greatly influenced by question-answering communities, suc... more Nowadays, software development has been greatly influenced by question-answering communities, such as Stack Overflow. A new problem-solving paradigm has emerged, as developers post problems they encounter that are then answered by the community. In this paper, we propose a methodology that allows searching for solutions in Stack Overflow, using the main elements of a question post, including not only its title, tags, and body, but also its source code snippets. We describe a similarity scheme for these elements and demonstrate how structural information can be extracted from source code snippets and compared to further improve the retrieval of questions. The results of our evaluation indicate that our methodology is effective on recommending similar question posts allowing community members to search without fully forming a question.
Data
The availability of code snippets in online repositories like GitHub has led to an uptick in code... more The availability of code snippets in online repositories like GitHub has led to an uptick in code reuse, this way further supporting an open-source component-based development paradigm. The likelihood of code reuse rises when the code components or snippets are of high quality, especially in terms of readability, making their integration and upkeep simpler. Toward this direction, we have developed a dataset of code snippets that takes into account both the functional and the quality characteristics of the snippets. The dataset is based on the CodeSearchNet corpus and comprises additional information, including static analysis metrics, code violations, readability assessments, and source code similarity metrics. Thus, using this dataset, both software researchers and practitioners can conveniently find and employ code snippets that satisfy diverse functional needs while also demonstrating excellent readability and maintainability.
Proceedings of the 18th International Conference on Software Technologies
Proceedings of the 18th International Conference on Software Technologies
IET Software
As more and more software teams use online issue tracking systems to collaborate on software proj... more As more and more software teams use online issue tracking systems to collaborate on software projects, the accurate assignment of new issues to the most suitable contributors may have significant impact on the success of the project. As a result, several research efforts have been directed towards automating this process to save considerable time and effort. However, most approaches focus mainly on software bugs and employ models that do not sufficiently take into account the semantics and the non‐textual metadata of issues and/or produce models that may require manual tuning. A methodology that extracts both textual and non‐textual features from different types of issues is designed, providing a Jira dataset that involves not only bugs but also new features, issues related to documentation, patches, etc. Moreover, the semantics of issue text are effectively captured by employing a topic modelling technique that is optimised using the assignment result. Finally, this methodology agg...
Lecture Notes in Computer Science, 2022
Nowadays, software development is accelerated through the reuse of code snippets found online in ... more Nowadays, software development is accelerated through the reuse of code snippets found online in question-answering platforms and software repositories. In order to be efficient, this process requires forming an appropriate query and identifying the most suitable code snippet, which can sometimes be challenging and particularly time-consuming. Over the last years, several code recommendation systems have been developed to offer a solution to this problem. Nevertheless, most of them recommend API calls or sequences instead of reusable code snippets. Furthermore, they do not employ architectures advanced enough to exploit the semantics of natural language and code in order to form the optimal query from the question posed. To overcome these issues, we propose CodeTransformer, a code recommendation system that provides useful, reusable code snippets extracted from open-source GitHub repositories. By employing a neural network architecture that comprises advanced attention mechanisms, our system effectively understands and models natural language queries and code snippets in a joint vector space. Upon evaluating CodeTransformer quantitatively against a similar system and qualitatively using a dataset from Stack Overflow, we conclude that our approach can recommend useful and reusable snippets to developers.
2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
The current state of practice dictates that in order to solve a problem encountered when building... more The current state of practice dictates that in order to solve a problem encountered when building software, developers ask for help in online platforms, such as Stack Overflow. In this context of collaboration, answers to question posts often undergo several edits to provide the best solution to the problem stated. In this work, we explore the potential of mining Stack Overflow answer edits to extract common patterns when answering a post. In particular, we design a similarity scheme that takes into account the text and code of answer edits and cluster edits according to their semantics. Upon applying our methodology, we provide frequent edit patterns and indicate how they could be used to answer future research questions. Our evaluation indicates that our approach can be effective for identifying commonly applied edits, thus illustrating the transformation path from the initial answer to the optimal solution.
Proceedings of the 14th International Conference on Software Technologies
The sharing and growth of open source software packages in the npm JavaScript (JS) ecosystem has ... more The sharing and growth of open source software packages in the npm JavaScript (JS) ecosystem has been exponential, not only in numbers but also in terms of interconnectivity, to the extend that often the size of dependencies has become more than the size of the written code. This reuse-oriented paradigm, often attributed to the lack of a standard library in node and/or in the micropackaging culture of the ecosystem, yields interesting insights on the way developers build their packages. In this work we view the dependency network of the npm ecosystem from a "culinary" perspective. We assume that dependencies are the ingredients in a recipe, which corresponds to the produced software package. We employ network analysis and information retrieval techniques in order to capture the dependencies that tend to co-occur in the development of npm packages and identify the communities that have been evolved as the main drivers for npm's exponential growth.
Nowadays, developers tend to adopt a component-based software engineering approach, reusing own i... more Nowadays, developers tend to adopt a component-based software engineering approach, reusing own implementations and/or resorting to third-party source code. This practice is in principle cost-effective, however it may also lead to low quality software products, if the components to be reused exhibit low quality. Thus, several approaches have been developed to measure the quality of software components. Most of them, however, rely on the aid of experts for defining target quality scores and deriving metric thresholds, leading to results that are context-dependent and subjective. In this work, we build a mechanism that employs static analysis metrics extracted from GitHub projects and defines a target quality score based on repositories’ stars and forks, which indicate their adoption/acceptance by developers. Upon removing outliers with a one-class classifier, we employ Principal Feature Analysis and examine the semantics among metrics to provide an analysis on five axes for source co...
IFIP Advances in Information and Communication Technology, 2020
The increase of the adoption of IoT devices and the contemporary problem of food production have ... more The increase of the adoption of IoT devices and the contemporary problem of food production have given rise to numerous applications of IoT in agriculture. These applications typically comprise a set of sensors that are installed in open fields and measure metrics, such as temperature or humidity, which are used for irrigation control systems. Though useful, most contemporary systems have high installation and maintenance costs, and they do not offer automated control or, if they do, they are usually not interpretable, and thus cannot be trusted for such critical applications. In this work, we design Vital, a system that incorporates a set of low-cost sensors, a robust data store, and most importantly an explainable AI decision support system. Our system outputs a fuzzy rule-base, which is interpretable and allows fully automating the irrigation of the fields. Upon evaluating Vital in two pilot cases, we conclude that it can be effective for monitoring open-field installations.
Proceedings of the 17th International Conference on Mining Software Repositories, 2020
The full integration of online repositories in the contemporary software development process prom... more The full integration of online repositories in the contemporary software development process promotes remote work and remote collaboration. Apart from the apparent benefits, online repositories offer a deluge of data that can be utilized to monitor and improve the software development process. Towards this direction, we have designed and implemented a platform that analyzes data from GitHub in order to compute a series of metrics that quantify the contributions of project collaborators, both from a development as well as an operations (communication) perspective. We analyze contributions in an evolutionary manner throughout the projects' lifecycle and track the number of coding violations generated, this way aspiring to identify cases of software development that need closer monitoring and (possibly) further actions to be taken. In this context, we have analyzed the 3000 most popular Java GitHub projects and provide the data to the community.
Language Resources and Evaluation, 2017
Mapping functional requirements first to specifications and then to code is one of the most chall... more Mapping functional requirements first to specifications and then to code is one of the most challenging tasks in software development. Since requirements are commonly written in natural language, they can be prone to ambiguity, incompleteness and inconsistency. Structured semantic representations allow requirements to be translated to formal models, which can be used to detect problems at an early stage of the development process through validation. Storing and querying such models can also facilitate software reuse. Several approaches constrain the input format of requirements to produce specifications, however they usually require considerable human effort in order to adopt domain-specific heuristics and/or controlled languages. We propose a mechanism that automates the mapping of requirements to formal representations using semantic role labeling. We describe the first publicly available dataset for this task, employ a hierarchical framework that allows requirements concepts to be annotated, and discuss how semantic role labeling can be adapted for parsing software requirements.
2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), 2016
The popularity of open source software repositories and the highly adopted paradigm of software r... more The popularity of open source software repositories and the highly adopted paradigm of software reuse have led to the development of several tools that aspire to assess the quality of source code. However, most software quality estimation tools, even the ones using adaptable models, depend on fixed metric thresholds for defining the ground truth. In this work we argue that the popularity of software components, as perceived by developers, can be considered as an indicator of software quality. We present a generic methodology that relates quality with source code metrics and estimates the quality of software components residing in popular GitHub repositories. Our methodology employs two models: a one-class classifier, used to rule out low quality code, and a neural network, that computes a quality score for each software component. Preliminary evaluation indicates that our approach can be effective for identifying high quality software components in the context of reuse.
Proceedings of the Sixth International Symposium on Business Modeling and Software Design, 2016
In order to maintain, extend or reuse software projects one has to primarily understand what a sy... more In order to maintain, extend or reuse software projects one has to primarily understand what a system does and how well it does it. And, while in some cases information on system functionality exists, information covering the non-functional aspects is usually unavailable. Thus, one has to infer such knowledge by extracting design patterns directly from the source code. Several tools have been developed to identify design patterns, however most of them are limited to compilable and in most cases executable code, they rely on complex representations, and do not offer the developer any control over the detected patterns. In this paper we present DP-CORE, a design pattern detection tool that defines a highly descriptive representation to detect known and define custom patterns. DP-CORE is flexible, identifying exact and approximate pattern versions even in non-compilable code. Our analysis indicates that DP-CORE provides an efficient alternative to existing design pattern detection tools.
Automated Software Engineering, 2016