Chanchal Roy | University of Saskatchewan (original) (raw)
Papers by Chanchal Roy
In today's open source era, developers look for similar software applications in source code repo... more In today's open source era, developers look for similar software applications in source code repositories for a number of reasons, including, exploring alternative implementations, reusing source code, or looking for a better application. However, while there are a great many studies for finding similar applications written in the same programming language, there is a marked lack of studies for finding similar software applications written in different languages. In this paper, we fill the gap by proposing a novel model CroLSim which is able to detect similar software applications across different programming languages. In our approach, we use the API documentation to find relationships among the API calls used by the different programming languages. We adopt a deep learning based wordvector learning method to identify semantic relationships among the API documentation which we then use to detect crosslanguage similar software applications. For evaluating CroLSim, we formed a repository consisting of 8,956 Java, 7,658 C#, and 10,232 Python applications collected from GitHub. We observed that CroLSim can successfully detect similar software applications across different programming languages with a mean average precision rate of 0.65, an average confidence rate of 3.6 (out of 5) with 75% high rated successful queries, which outperforms all related existing approaches with a significant performance improvement.
Software developers often look for solutions to their code level problems at Stack Overflow. Henc... more Software developers often look for solutions to their code level problems at Stack Overflow. Hence, they frequently submit their questions with sample code segments and issue descriptions. Unfortunately, it is not always possible to reproduce their reported issues from such code segments. This phenomenon might prevent their questions from getting prompt and appropriate solutions. In this paper, we report an exploratory study on the reproducibility of the issues discussed in 400 questions of Stack Overflow. In particular, we parse, compile, execute and even carefully examine the code segments from these questions, spent a total of 200 man hours, and then attempt to reproduce their programming issues. The outcomes of our study are two-fold. First, we find that 68% of the code segments require minor and major modifications in order to reproduce the issues reported by the developers. On the contrary, 22% code segments completely fail to reproduce the issues. We also carefully investigate why these issues could not be reproduced and then provide evidence-based guidelines for writing effective code examples for Stack Overflow questions. Second, we investigate the correlation between issue reproducibility status (of questions) and corresponding answer meta-data such as the presence of an accepted answer. According to our analysis, a question with reproducible issues has at least three times higher chance of receiving an accepted answer than the question with irreproducible issues.
Developers often prefer dynamically typed programming languages, such as JavaScript, because such... more Developers often prefer dynamically typed programming languages, such as JavaScript, because such languages do not require explicit type declarations. However, such a feature hinders software engineering tasks, such as code completion, type related bug fixes and so on. Deep learning-based techniques are proposed in the literature to infer the types of code elements in JavaScript snippets. These techniques are computationally expensive. While several type inference techniques have been developed to detect types in code snippets written in statically typed languages, it is not clear how effective those techniques are for inferring types in dynamically typed languages, such as JavaScript. In this paper, we investigate the type inference techniques of JavaScript to understand the above two issues further. While doing that we propose a new technique that considers the locally specific code tokens as the context to infer the types of code elements. The evaluation result shows that the proposed technique is 20-47% more accurate than the statically typed language-based techniques and 5-14 times faster than the deep learning techniques without sacrificing accuracy. Our analysis of sensitivity, overlapping of predicted types and the number of training examples justify the importance of our technique.
Recent ndings from a user study suggest that IR-based bug localization techniques do not perform ... more Recent ndings from a user study suggest that IR-based bug localization techniques do not perform well if the bug report lacks rich structured information such as relevant program entity names. On the contrary, excessive structured information such as stack traces in the bug report might always not be helpful for the automated bug localization. In this paper, we conduct a large empirical study using 5,500 bug reports from eight subject systems and replicating three existing studies from the literature. Our ndings (1) empirically demonstrate how quality dynamics of bug reports a ect the performances of IR-based bug localization, and (2) suggest potential ways (e.g., query reformulations) to overcome such limitations.
Code reuse by copying and pasting from one place to another place in a codebase is a very common ... more Code reuse by copying and pasting from one place to another place in a codebase is a very common scenario in software development which is also one of the most typical reasons for introducing code clones. There is a huge availability of tools to detect such cloned fragments and a lot of studies have already been done for efficient clone detection. There are also several studies for evaluating those tools considering their clone detection effectiveness. Unfortunately, we find no study which compares different clone detection tools in the perspective of detecting cloned co-change candidates during software evolution. Detecting cloned co-change candidates is essential for clone tracking. In this study, we wanted to explore this dimension of code clone research. We used six promising clone detection tools to identify cloned and non-cloned co-change candidates from six C and Java-based subject systems and evaluated the performance of those clone detection tools in detecting the cloned co-change fragments. Our findings show that a good clone detector may not perform well in detecting cloned co-change candidates. The amount of unique lines covered by a clone detector and the number of detected clone fragments plays an important role in its performance. The findings of this study can enrich a new dimension of code clone research.
Code clones, identical or nearly similar code fragments in a software system's code-base, have mi... more Code clones, identical or nearly similar code fragments in a software system's code-base, have mixed impacts on software evolution and maintenance. Focusing on the issues of clones researchers suggest managing them through refactoring, and tracking. In this paper we present a survey on the stateof-the-art of clone refactoring and tracking techniques, and identify future research possibilities in these areas. We define the quality assessment features for the clone refactoring and tracking tools, and make a comparison among these tools considering these features. To the best of our knowledge, our survey is the first comprehensive study on clone refactoring and tracking. According to our survey on clone refactoring we realize that automatic refactoring cannot eradicate the necessity of manual e↵ort regarding finding refactoring opportunities, and post refactoring testing of system behaviour. Post refactoring testing can require a significant amount of time and e↵ort from the quality assurance engineers. There is a marked lack of research on the e↵ect of clone refactoring on system performance. Future investigations in this direction will add much value to clone refactoring research. We also feel the necessity of future research towards real-time detection, and tracking of code clones in a big-data environment.
Scientific workflow management systems such as Galaxy, Taverna and Workspace, have been developed... more Scientific workflow management systems such as Galaxy, Taverna and Workspace, have been developed to automate scientific workflow management and are increasingly being used to accelerate the specification, execution, visualization, and monitoring of data-intensive tasks. For example, the popular bioinformatics platform Galaxy is installed on over 168 servers around the world and the social networking space myExperiment shares almost 4,000 Galaxy scientific workflows among its 10,665 members. Most of these systems offer graphical interfaces for composing workflows. However, while graphical languages are considered easier to use, graphical workflow models are more difficult to comprehend and maintain as they become larger and more complex. Text-based languages are considered harder to use but have the potential to provide a clean and concise expression of workflow even for large and complex workflows. A recent study showed that some scientists prefer script/text-based environments to perform complex scientific analysis with workflows. Unfortunately, such environments are unable to meet the needs of scientists who prefer graphical workflows. In order to address the needs of both types of scientists and at the same time to have script-based workflow models because of their underlying benefits, we propose a visually guided workflow modeling framework that combines interactive graphical user interface elements in an integrated development environment with the power of a domain-specific language to compose independently developed and loosely coupled services into workflows. Our domain-specific language provides scientists with a clean, concise, and abstract view of workflow to better support workflow modeling. As a proof of concept, we developed VizSciFlow, a generalized scientific workflow management system that can be customized for use in a variety of scientific domains. As a first use case, we configured and customized VizSciFlow for the bioinformatics domain. We conducted three user studies to assess its usability, expressiveness, efficiency, and flexibility. Results are promising, and in particular, our user studies show that VizSciFlow is more desirable for users to use than either Python or Galaxy for solving complex scientific problems.
The identical or nearly similar code fragments in a code-base are called code clones.
Identical or nearly similar code fragments in a software system's code-base are known as code clo... more Identical or nearly similar code fragments in a software system's code-base are known as code clones. Code clones from the same clone class have a tendency of co-changing (changing together) consistently during evolution. Focusing on this co-change tendency, existing studies have investigated prediction and ranking co-change candidates of regular clones. However, a recent study shows that micro-clones which are smaller than the minimum size threshold of regular clones might also need to be co-changed consistently during evolution. Thus, identifying and ranking co-change candidates of micro-clones is also important. In this paper, we investigate factors that influence the co-change tendency of the co-change candidates of a target micro-clone fragment.
Scientific Workflow Management Systems (SWfMSs) have become popular in recent years for accelerat... more Scientific Workflow Management Systems (SWfMSs) have become popular in recent years for accelerating the specification, execution, visualization, and monitoring of data-intensive tasks. Unfortunately, to the best of our knowledge no existing SWfMSs directly support collaboration. Data is increasing in complexity, dimensionality, and volume, and the efficient analysis of data often goes beyond the realm of an individual and requires collaboration with multiple researchers from varying domains. In this paper, we propose a groupware system architecture for data analysis that in addition to supporting collaboration, also incorporates features from SWfMSs to support modern data analysis processes. As a proof of concept for the proposed architecture we developed SciWorCS -a groupware system for scientific data analysis. We present two real-world use-cases: collaborative software repository analysis and bioinformatics data analysis. The results of the experiments evaluating the proposed system are promising. Our bioinformatics user study demonstrates that SciWorCS can leverage real-world data analysis tasks by supporting real-time collaboration among users.
Code cloning is a recurrent operation in everyday software development. Whether it is a good or b... more Code cloning is a recurrent operation in everyday software development. Whether it is a good or bad practice is an ongoing debate among researchers and developers for the last few decades. In this paper, we conduct a comparative study on bugproneness in clone code and non-clone code by analyzing commit logs. According to our inspection on thousands of revisions of seven diverse subject systems, the percentage of changed files due to bug-fix commits is significantly higher in clone code compared with non-clone code. We perform a Mann-Whitney-Wilcoxon (MWW) test to show the statistical significance of our findings. Finally, the possibility of severe bugs occurring is higher in clone code than in non-clone code. Bug-fixing changes affecting clone code should be considered more carefully. According to our findings, clone code appears to be more bug-prone than non-clone code.
While designing a software system, minimizing coupling among program entities (such as files, cla... more While designing a software system, minimizing coupling among program entities (such as files, classes, methods) is always desirable. If a software entity is coupled with many other entities, this might be an indication of poor software design because changing that entity will likely have ripple change effects on the other coupled entities. Evolutionary coupling, also known as change coupling, is a well investigated way of identifying coupling among program entities. Existing studies have investigated whether file level or class level evolutionary couplings are related with software bug-proneness. While these existing studies have mixed findings regarding the relationship between bug-proneness and evolutionary coupling, none of these studies investigated whether method level (i.e., function level for procedural languages) evolutionary coupling is correlated with bug-proneness. Investigation considering a finer granularity (i.e., such as method level granularity) can help us pinpoint which methods in the files or classes are actually responsible for coupling as well as bug-proneness.
Developers often reuse code snippets from online forums, such as Stack Overflow, to learn API usa... more Developers often reuse code snippets from online forums, such as Stack Overflow, to learn API usages of software frameworks or libraries. These code snippets often contain ambiguous undeclared external references. Such external references make it difficult to learn and use those APIs correctly. In particular, reusing code snippets containing such ambiguous undeclared external references requires significant manual efforts and expertise to resolve them. Manually resolving fully qualified names (FQN) of API elements is a non-trivial task. In this paper, we propose a novel context-sensitive technique, called COSTER, to resolve FQNs of API elements in such code snippets. The proposed technique collects locally specific source code elements as well as globally related tokens as the context of FQNs, calculates likelihood scores, and builds an occurrence likelihood dictionary (OLD). Given an API element as a query, COSTER captures the context of the query API element, matches that with the FQNs of API elements stored in the OLD, and rank those matched FQNs leveraging three different scores: likelihood, context similarity, and name similarity scores. Evaluation with more than 600K code examples collected from GitHub and two different Stack Overflow datasets shows that our proposed technique improves precision by 4-6% and recall by 3-22% compared to state-ofthe-art techniques. The proposed technique significantly reduces the training time compared to the StatType, a state-of-the-art technique, without sacrificing accuracy. Extensive analyses on results demonstrate the robustness of the proposed technique.
Code clones are exactly or nearly similar code fragments in the code-base of a software system. E... more Code clones are exactly or nearly similar code fragments in the code-base of a software system. Existing studies show that clones are directly related to bugs and inconsistencies in the code-base. Code cloning (making code clones) is suspected to be responsible for replicating bugs in the code fragments. However, there is no study on the possibilities of bug-replication through cloning process. Such a study can help us discover ways of minimizing bug-replication. Focusing on this we conduct an empirical study on the intensities of bug-replication in the code clones of the major clone-types: Type 1, Type 2, and Type 3.
Code cloning is a recurrent operation in everyday software development. Whether it is a good or b... more Code cloning is a recurrent operation in everyday software development. Whether it is a good or bad practice is an ongoing debate among researchers and developers for the last few decades. In this paper, we conduct a comparative study on bug-proneness in clone code and non-clone code by analyzing commit logs. According to our inspection of thousands of revisions of seven diverse subject systems, the percentage of changed¯les due to bug-¯x commits is signi¯cantly higher in clone code compared with non-clone code. We perform a Mann-Whitney-Wilcoxon (MWW) test to show the statistical signi¯cance of our¯ndings. In addition, the possibility of occurrence of severe bugs is higher in clone code than in non-clone code. Bug-¯xing changes a®ecting clone code should be considered more carefully. Finally, our manual investigation shows that clone code containing if-condition and if-else blocks has a high risk of having severing bugs. Changes to such types of clone fragments should be done carefully during software maintenance. According to our¯ndings, clone code appears to be more bug-prone than non-clone code.
Big-data analytics or systems developed with parallel distributed processing frameworks (e.g., Ha... more Big-data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text and sensor data). These systems o↵er a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from di↵erent studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big-data management for micro-level modular computation-intensive programs in a Spark and Hadoop based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.
Nowadays, workflows are being frequently built and used for systematically processing large datas... more Nowadays, workflows are being frequently built and used for systematically processing large datasets in workflow management systems (WMS). A workflow (i.e., a pipeline) is a sequential organization of a finite set of processing modules that are applied on a particular dataset for producing a desired output. In a workflow management system, the users generally create workflows manually for their own investigations. However, workflows can sometimes be lengthy and the constituent processing modules might often be computationally expensive. In this situation, it would be beneficial if a user could reuse intermediate stage results generated by previously executed workflows for executing his current workflow.
Scientific Workflow Management System (SWFMS) is one of the inherent parts of Big Data analytics ... more Scientific Workflow Management System (SWFMS) is one of the inherent parts of Big Data analytics systems. Analyses in such dataintensive research using workflows are very costly. SWFMSs or workflows, keep track of every bit of executions through logs, which later could be used on demand. For example, in the case of errors, security breaches or even any conditions, we may need to trace back to the previous steps or look at the intermediate data elements. Such fashion of logging is known as workflow provenance. However, prominent workflows being domain specific and developed following different programming paradigms, their architectures, logging mechanisms, information in the logs, provenance queries and so on differ significantly. So, provenance technology of one workflow from a certain domain is not easily applicable in another domain. Facing the lack of a general workflow provenance standard, we propose a programming model for automated workflow logging. The programming model is easy to implement and easily configurable by domain experts independent of workflow users. We implement our workflow programming model on Bioinformatics research-for evaluation and collect workflow logs from various scientific pipelines' executions. Then we focus on some fundamental provenance questions inspired by recent literature that can derive many other complex provenance questions. Finally, the end users are provided with discovered insights from the workflow provenance through online data visualization as a separate web service.
Software systems in this big data era are growing larger and becoming more intricate. Tracking an... more Software systems in this big data era are growing larger and becoming more intricate. Tracking and managing code clones in such evolving software systems are challenging tasks. To understand how clone fragments are evolving, the programmers often analyze the co-evolution of clone fragments manually to decide about refactoring, tracking, and bug removal. Such manual analysis is infeasible for a large number of clones with clones evolving over hundreds of software revisions. We propose a visual analytics framework, that leverages big data visualization techniques to manage code clones in large software systems. Our framework combines multiple information-linked zoomable views, where users can explore and analyze clones through interactive exploration in real time. We discuss several scenarios where our framework may assist developers in real-life software development and clone maintenance. Experts' reviews reveal many future potentials of our framework.
In today's open source era, developers look for similar software applications in source code repo... more In today's open source era, developers look for similar software applications in source code repositories for a number of reasons, including, exploring alternative implementations, reusing source code, or looking for a better application. However, while there are a great many studies for finding similar applications written in the same programming language, there is a marked lack of studies for finding similar software applications written in different languages. In this paper, we fill the gap by proposing a novel model CroLSim which is able to detect similar software applications across different programming languages. In our approach, we use the API documentation to find relationships among the API calls used by the different programming languages. We adopt a deep learning based wordvector learning method to identify semantic relationships among the API documentation which we then use to detect crosslanguage similar software applications. For evaluating CroLSim, we formed a repository consisting of 8,956 Java, 7,658 C#, and 10,232 Python applications collected from GitHub. We observed that CroLSim can successfully detect similar software applications across different programming languages with a mean average precision rate of 0.65, an average confidence rate of 3.6 (out of 5) with 75% high rated successful queries, which outperforms all related existing approaches with a significant performance improvement.
Software developers often look for solutions to their code level problems at Stack Overflow. Henc... more Software developers often look for solutions to their code level problems at Stack Overflow. Hence, they frequently submit their questions with sample code segments and issue descriptions. Unfortunately, it is not always possible to reproduce their reported issues from such code segments. This phenomenon might prevent their questions from getting prompt and appropriate solutions. In this paper, we report an exploratory study on the reproducibility of the issues discussed in 400 questions of Stack Overflow. In particular, we parse, compile, execute and even carefully examine the code segments from these questions, spent a total of 200 man hours, and then attempt to reproduce their programming issues. The outcomes of our study are two-fold. First, we find that 68% of the code segments require minor and major modifications in order to reproduce the issues reported by the developers. On the contrary, 22% code segments completely fail to reproduce the issues. We also carefully investigate why these issues could not be reproduced and then provide evidence-based guidelines for writing effective code examples for Stack Overflow questions. Second, we investigate the correlation between issue reproducibility status (of questions) and corresponding answer meta-data such as the presence of an accepted answer. According to our analysis, a question with reproducible issues has at least three times higher chance of receiving an accepted answer than the question with irreproducible issues.
Developers often prefer dynamically typed programming languages, such as JavaScript, because such... more Developers often prefer dynamically typed programming languages, such as JavaScript, because such languages do not require explicit type declarations. However, such a feature hinders software engineering tasks, such as code completion, type related bug fixes and so on. Deep learning-based techniques are proposed in the literature to infer the types of code elements in JavaScript snippets. These techniques are computationally expensive. While several type inference techniques have been developed to detect types in code snippets written in statically typed languages, it is not clear how effective those techniques are for inferring types in dynamically typed languages, such as JavaScript. In this paper, we investigate the type inference techniques of JavaScript to understand the above two issues further. While doing that we propose a new technique that considers the locally specific code tokens as the context to infer the types of code elements. The evaluation result shows that the proposed technique is 20-47% more accurate than the statically typed language-based techniques and 5-14 times faster than the deep learning techniques without sacrificing accuracy. Our analysis of sensitivity, overlapping of predicted types and the number of training examples justify the importance of our technique.
Recent ndings from a user study suggest that IR-based bug localization techniques do not perform ... more Recent ndings from a user study suggest that IR-based bug localization techniques do not perform well if the bug report lacks rich structured information such as relevant program entity names. On the contrary, excessive structured information such as stack traces in the bug report might always not be helpful for the automated bug localization. In this paper, we conduct a large empirical study using 5,500 bug reports from eight subject systems and replicating three existing studies from the literature. Our ndings (1) empirically demonstrate how quality dynamics of bug reports a ect the performances of IR-based bug localization, and (2) suggest potential ways (e.g., query reformulations) to overcome such limitations.
Code reuse by copying and pasting from one place to another place in a codebase is a very common ... more Code reuse by copying and pasting from one place to another place in a codebase is a very common scenario in software development which is also one of the most typical reasons for introducing code clones. There is a huge availability of tools to detect such cloned fragments and a lot of studies have already been done for efficient clone detection. There are also several studies for evaluating those tools considering their clone detection effectiveness. Unfortunately, we find no study which compares different clone detection tools in the perspective of detecting cloned co-change candidates during software evolution. Detecting cloned co-change candidates is essential for clone tracking. In this study, we wanted to explore this dimension of code clone research. We used six promising clone detection tools to identify cloned and non-cloned co-change candidates from six C and Java-based subject systems and evaluated the performance of those clone detection tools in detecting the cloned co-change fragments. Our findings show that a good clone detector may not perform well in detecting cloned co-change candidates. The amount of unique lines covered by a clone detector and the number of detected clone fragments plays an important role in its performance. The findings of this study can enrich a new dimension of code clone research.
Code clones, identical or nearly similar code fragments in a software system's code-base, have mi... more Code clones, identical or nearly similar code fragments in a software system's code-base, have mixed impacts on software evolution and maintenance. Focusing on the issues of clones researchers suggest managing them through refactoring, and tracking. In this paper we present a survey on the stateof-the-art of clone refactoring and tracking techniques, and identify future research possibilities in these areas. We define the quality assessment features for the clone refactoring and tracking tools, and make a comparison among these tools considering these features. To the best of our knowledge, our survey is the first comprehensive study on clone refactoring and tracking. According to our survey on clone refactoring we realize that automatic refactoring cannot eradicate the necessity of manual e↵ort regarding finding refactoring opportunities, and post refactoring testing of system behaviour. Post refactoring testing can require a significant amount of time and e↵ort from the quality assurance engineers. There is a marked lack of research on the e↵ect of clone refactoring on system performance. Future investigations in this direction will add much value to clone refactoring research. We also feel the necessity of future research towards real-time detection, and tracking of code clones in a big-data environment.
Scientific workflow management systems such as Galaxy, Taverna and Workspace, have been developed... more Scientific workflow management systems such as Galaxy, Taverna and Workspace, have been developed to automate scientific workflow management and are increasingly being used to accelerate the specification, execution, visualization, and monitoring of data-intensive tasks. For example, the popular bioinformatics platform Galaxy is installed on over 168 servers around the world and the social networking space myExperiment shares almost 4,000 Galaxy scientific workflows among its 10,665 members. Most of these systems offer graphical interfaces for composing workflows. However, while graphical languages are considered easier to use, graphical workflow models are more difficult to comprehend and maintain as they become larger and more complex. Text-based languages are considered harder to use but have the potential to provide a clean and concise expression of workflow even for large and complex workflows. A recent study showed that some scientists prefer script/text-based environments to perform complex scientific analysis with workflows. Unfortunately, such environments are unable to meet the needs of scientists who prefer graphical workflows. In order to address the needs of both types of scientists and at the same time to have script-based workflow models because of their underlying benefits, we propose a visually guided workflow modeling framework that combines interactive graphical user interface elements in an integrated development environment with the power of a domain-specific language to compose independently developed and loosely coupled services into workflows. Our domain-specific language provides scientists with a clean, concise, and abstract view of workflow to better support workflow modeling. As a proof of concept, we developed VizSciFlow, a generalized scientific workflow management system that can be customized for use in a variety of scientific domains. As a first use case, we configured and customized VizSciFlow for the bioinformatics domain. We conducted three user studies to assess its usability, expressiveness, efficiency, and flexibility. Results are promising, and in particular, our user studies show that VizSciFlow is more desirable for users to use than either Python or Galaxy for solving complex scientific problems.
The identical or nearly similar code fragments in a code-base are called code clones.
Identical or nearly similar code fragments in a software system's code-base are known as code clo... more Identical or nearly similar code fragments in a software system's code-base are known as code clones. Code clones from the same clone class have a tendency of co-changing (changing together) consistently during evolution. Focusing on this co-change tendency, existing studies have investigated prediction and ranking co-change candidates of regular clones. However, a recent study shows that micro-clones which are smaller than the minimum size threshold of regular clones might also need to be co-changed consistently during evolution. Thus, identifying and ranking co-change candidates of micro-clones is also important. In this paper, we investigate factors that influence the co-change tendency of the co-change candidates of a target micro-clone fragment.
Scientific Workflow Management Systems (SWfMSs) have become popular in recent years for accelerat... more Scientific Workflow Management Systems (SWfMSs) have become popular in recent years for accelerating the specification, execution, visualization, and monitoring of data-intensive tasks. Unfortunately, to the best of our knowledge no existing SWfMSs directly support collaboration. Data is increasing in complexity, dimensionality, and volume, and the efficient analysis of data often goes beyond the realm of an individual and requires collaboration with multiple researchers from varying domains. In this paper, we propose a groupware system architecture for data analysis that in addition to supporting collaboration, also incorporates features from SWfMSs to support modern data analysis processes. As a proof of concept for the proposed architecture we developed SciWorCS -a groupware system for scientific data analysis. We present two real-world use-cases: collaborative software repository analysis and bioinformatics data analysis. The results of the experiments evaluating the proposed system are promising. Our bioinformatics user study demonstrates that SciWorCS can leverage real-world data analysis tasks by supporting real-time collaboration among users.
Code cloning is a recurrent operation in everyday software development. Whether it is a good or b... more Code cloning is a recurrent operation in everyday software development. Whether it is a good or bad practice is an ongoing debate among researchers and developers for the last few decades. In this paper, we conduct a comparative study on bugproneness in clone code and non-clone code by analyzing commit logs. According to our inspection on thousands of revisions of seven diverse subject systems, the percentage of changed files due to bug-fix commits is significantly higher in clone code compared with non-clone code. We perform a Mann-Whitney-Wilcoxon (MWW) test to show the statistical significance of our findings. Finally, the possibility of severe bugs occurring is higher in clone code than in non-clone code. Bug-fixing changes affecting clone code should be considered more carefully. According to our findings, clone code appears to be more bug-prone than non-clone code.
While designing a software system, minimizing coupling among program entities (such as files, cla... more While designing a software system, minimizing coupling among program entities (such as files, classes, methods) is always desirable. If a software entity is coupled with many other entities, this might be an indication of poor software design because changing that entity will likely have ripple change effects on the other coupled entities. Evolutionary coupling, also known as change coupling, is a well investigated way of identifying coupling among program entities. Existing studies have investigated whether file level or class level evolutionary couplings are related with software bug-proneness. While these existing studies have mixed findings regarding the relationship between bug-proneness and evolutionary coupling, none of these studies investigated whether method level (i.e., function level for procedural languages) evolutionary coupling is correlated with bug-proneness. Investigation considering a finer granularity (i.e., such as method level granularity) can help us pinpoint which methods in the files or classes are actually responsible for coupling as well as bug-proneness.
Developers often reuse code snippets from online forums, such as Stack Overflow, to learn API usa... more Developers often reuse code snippets from online forums, such as Stack Overflow, to learn API usages of software frameworks or libraries. These code snippets often contain ambiguous undeclared external references. Such external references make it difficult to learn and use those APIs correctly. In particular, reusing code snippets containing such ambiguous undeclared external references requires significant manual efforts and expertise to resolve them. Manually resolving fully qualified names (FQN) of API elements is a non-trivial task. In this paper, we propose a novel context-sensitive technique, called COSTER, to resolve FQNs of API elements in such code snippets. The proposed technique collects locally specific source code elements as well as globally related tokens as the context of FQNs, calculates likelihood scores, and builds an occurrence likelihood dictionary (OLD). Given an API element as a query, COSTER captures the context of the query API element, matches that with the FQNs of API elements stored in the OLD, and rank those matched FQNs leveraging three different scores: likelihood, context similarity, and name similarity scores. Evaluation with more than 600K code examples collected from GitHub and two different Stack Overflow datasets shows that our proposed technique improves precision by 4-6% and recall by 3-22% compared to state-ofthe-art techniques. The proposed technique significantly reduces the training time compared to the StatType, a state-of-the-art technique, without sacrificing accuracy. Extensive analyses on results demonstrate the robustness of the proposed technique.
Code clones are exactly or nearly similar code fragments in the code-base of a software system. E... more Code clones are exactly or nearly similar code fragments in the code-base of a software system. Existing studies show that clones are directly related to bugs and inconsistencies in the code-base. Code cloning (making code clones) is suspected to be responsible for replicating bugs in the code fragments. However, there is no study on the possibilities of bug-replication through cloning process. Such a study can help us discover ways of minimizing bug-replication. Focusing on this we conduct an empirical study on the intensities of bug-replication in the code clones of the major clone-types: Type 1, Type 2, and Type 3.
Code cloning is a recurrent operation in everyday software development. Whether it is a good or b... more Code cloning is a recurrent operation in everyday software development. Whether it is a good or bad practice is an ongoing debate among researchers and developers for the last few decades. In this paper, we conduct a comparative study on bug-proneness in clone code and non-clone code by analyzing commit logs. According to our inspection of thousands of revisions of seven diverse subject systems, the percentage of changed¯les due to bug-¯x commits is signi¯cantly higher in clone code compared with non-clone code. We perform a Mann-Whitney-Wilcoxon (MWW) test to show the statistical signi¯cance of our¯ndings. In addition, the possibility of occurrence of severe bugs is higher in clone code than in non-clone code. Bug-¯xing changes a®ecting clone code should be considered more carefully. Finally, our manual investigation shows that clone code containing if-condition and if-else blocks has a high risk of having severing bugs. Changes to such types of clone fragments should be done carefully during software maintenance. According to our¯ndings, clone code appears to be more bug-prone than non-clone code.
Big-data analytics or systems developed with parallel distributed processing frameworks (e.g., Ha... more Big-data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text and sensor data). These systems o↵er a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from di↵erent studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big-data management for micro-level modular computation-intensive programs in a Spark and Hadoop based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.
Nowadays, workflows are being frequently built and used for systematically processing large datas... more Nowadays, workflows are being frequently built and used for systematically processing large datasets in workflow management systems (WMS). A workflow (i.e., a pipeline) is a sequential organization of a finite set of processing modules that are applied on a particular dataset for producing a desired output. In a workflow management system, the users generally create workflows manually for their own investigations. However, workflows can sometimes be lengthy and the constituent processing modules might often be computationally expensive. In this situation, it would be beneficial if a user could reuse intermediate stage results generated by previously executed workflows for executing his current workflow.
Scientific Workflow Management System (SWFMS) is one of the inherent parts of Big Data analytics ... more Scientific Workflow Management System (SWFMS) is one of the inherent parts of Big Data analytics systems. Analyses in such dataintensive research using workflows are very costly. SWFMSs or workflows, keep track of every bit of executions through logs, which later could be used on demand. For example, in the case of errors, security breaches or even any conditions, we may need to trace back to the previous steps or look at the intermediate data elements. Such fashion of logging is known as workflow provenance. However, prominent workflows being domain specific and developed following different programming paradigms, their architectures, logging mechanisms, information in the logs, provenance queries and so on differ significantly. So, provenance technology of one workflow from a certain domain is not easily applicable in another domain. Facing the lack of a general workflow provenance standard, we propose a programming model for automated workflow logging. The programming model is easy to implement and easily configurable by domain experts independent of workflow users. We implement our workflow programming model on Bioinformatics research-for evaluation and collect workflow logs from various scientific pipelines' executions. Then we focus on some fundamental provenance questions inspired by recent literature that can derive many other complex provenance questions. Finally, the end users are provided with discovered insights from the workflow provenance through online data visualization as a separate web service.
Software systems in this big data era are growing larger and becoming more intricate. Tracking an... more Software systems in this big data era are growing larger and becoming more intricate. Tracking and managing code clones in such evolving software systems are challenging tasks. To understand how clone fragments are evolving, the programmers often analyze the co-evolution of clone fragments manually to decide about refactoring, tracking, and bug removal. Such manual analysis is infeasible for a large number of clones with clones evolving over hundreds of software revisions. We propose a visual analytics framework, that leverages big data visualization techniques to manage code clones in large software systems. Our framework combines multiple information-linked zoomable views, where users can explore and analyze clones through interactive exploration in real time. We discuss several scenarios where our framework may assist developers in real-life software development and clone maintenance. Experts' reviews reveal many future potentials of our framework.