Tim Menzies - Academia.edu (original) (raw)
Papers by Tim Menzies
2009 IEEE/ACM International Conference on Automated Software Engineering, 2009
When AI search methods are applied to software process models, then appropriate technologies can ... more When AI search methods are applied to software process models, then appropriate technologies can be discovered for a software project. We show that those recommendations are greatly affected by the business context of its use. For example, the automatic defect reduction tools explored by the ASE community are only relevant to a subset of software projects, and only according to certain value criteria. Therefore, when arguing for the value of a particular technology, that argument should include a description of the value function of the target user community.
The current generation of software analytics tools are mostly prediction algorithms (e.g. support... more The current generation of software analytics tools are mostly prediction algorithms (e.g. support vector machines, naive bayes, logistic regression, etc). While prediction is useful, after prediction comes planning about what actions to take in order to improve quality. This research seeks methods that generate demonstrably useful guidance on ''what to do'' within the context of a specific software project. Specifically, we propose XTREE (for within-project planning) and BELLTREE (for cross-project planning) to generating plans that can improve software quality. Each such plan has the property that, if followed, it reduces the probability of future defect reports. When compared to other planning algorithms from the SE literature, we find that this new approach is most effective at learning plans from one project, then applying those plans to another. In 10 open-source JAVA systems, several hundreds of defects were reduced in sections of the code that followed the pla...
Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibli... more Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. License This work is licensed under a Creative Commons Attribution 3.0 Unported license: CC-BY. In brief, this license authorizes each and everybody to share (to copy, distribute and transmit) the work under the following conditions, without impairing or restricting the authors' moral rights: Attribution: The work must be attributed to its authors. The copyright is retained by the corresponding authors. Digital Object Identifier: 10.4230/DagRep.4.6.i Aims and Scope The periodical Dagstuhl Reports documents the program and the results of Dagstuhl Seminars and Dagstuhl Perspectives Workshops. In principal, for each Dagstuhl Seminar or Dagstuhl Perspectives Workshop a report is published that contains the following: an executive summary of the seminar program and the fundamental results, an overview of the talks given during the seminar (summarized as talk abstracts), and summaries from working groups (if applicable). This basic framework can be extended by suitable contributions that are related to the program of the seminar, e. g. summaries from panel discussions or open problem sessions.
This report documents the program and the outcomes of Dagstuhl Seminar 14261 "Software Devel... more This report documents the program and the outcomes of Dagstuhl Seminar 14261 "Software Development Analytics". We briefly summarize the goals and format of the seminar, the results of the break out groups, and a draft of a manifesto for software analytics. The report also includes the abstracts of the talks presented at the seminar. Seminar June 22-27, 2014-http://www.dagstuhl.de/14261 1998 ACM Subject Classification D.2 Software Engineering
ArXiv, 2017
Transfer learning has been the subject of much recent research. In practice, that research means ... more Transfer learning has been the subject of much recent research. In practice, that research means that the models are unstable since they are continually revised whenever new data arrives. This paper offers a very simple “bellwether” transfer learner. Given N datasets, we find which one produces the best predictions on all the others. This “bellwether” dataset is then used for all subsequent predictions (when its predictions start failing, one may seek another bellwether). Bellwethers are interesting since they are very simple to find (wrap a for-loop around standard data miners). They simplify the task of making general policies in software engineering since as long as one bellwether remains useful, stable conclusions for N datasets can be achieved by reasoning over that bellwether. This paper shows that this bellwether approach works for multiple datasets from various domains in SE. From this, we conclude that (1) bellwether method is a useful (and simple) transfer learner; (2) Unl...
Before researchers rush to reason across all available data, they should first check if the infor... more Before researchers rush to reason across all available data, they should first check if the information is densest within some small region. We say this since, in 240 GitHub projects, we find that the information in that data “clumps” towards the earliest parts of the project. In fact, a defect prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this early life cycle data, we can build models very quickly (using weeks, not months, of CPU time). Also, we can find simple models (with just two features) that generalize to hundreds of software projects. Based on this experience, we warn that prior work on generalizing software engineering defect prediction models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data now needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data a...
There has been much recent interest in the application of deep learning neural networks in softwa... more There has been much recent interest in the application of deep learning neural networks in software engineering. Some researchers are worried that deep learning is being applied with insufficient critical ssessment. Hence, for one well-studied software analytics task (defect prediction), this paper compares deep learning versus prior-state-of-the-art results. Deep learning will outperform those prior results, but only after adjusting its hyperparameters using GHOST (Goal-oriented Hyperparameter Optimization for Scalable Training). For defect prediction, GHOST terminates in just a few minutes and scales to larger data sets; i.e. it is practical to tune deep learning tuning for defect prediction. Hence this paper recommends deep learning for defect prediction, but only adjusting its goal predicates and tuning its hyperparameters (using some hyperparameter optimization tool, like GHOST)
Despite decades of research, SE lacks widely accepted models (that offer precise quantitative pre... more Despite decades of research, SE lacks widely accepted models (that offer precise quantitative predictions) about what factors most influence software quality. This paper provides a “good news” result that such general models can be generated using a new transfer learning framework called “GENERAL”. Given a tree of recursively clustered projects (using project meta-data), GENERAL promotes a model upwards if it performs best in the lower clusters (stopping when the promoted model performs worse than the models seen at a lower level). The number of models found by GENERAL is minimal: one for defect prediction (756 projects) and less than a dozen for project health (1628 projects). Hence, via GENERAL, it is possible to make conclusions that hold across hundreds of projects at a time. Further, the models produced in this manner offer predictions that perform as well or better than prior state-of-the-art. To the best of our knowledge, this is the largest demonstration of the generalizabil...
2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021
Many methods in defect prediction are "datahungry"; i.e. (1) given a choice of using more data, o... more Many methods in defect prediction are "datahungry"; i.e. (1) given a choice of using more data, or some smaller sample, researchers assume that more is better; (2) when data is missing, researchers take elaborate steps to transfer data from another project; and (3) given a choice of older data or some more recent sample, researchers usually ignore older data. Based on the analysis of hundreds of popular Github projects (with 1.2 million commits), we suggest that for defect prediction, there is limited value in such data-hungry approaches. Data for our sample of projects last for 84 months and contains 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, defect predictors learned from the first 150 commits and four months perform just as well as anything else. This means that, contrary to the "data-hungry" approach, (1) small samples of data from these projects are all that is needed for defect prediction; (2) transfer learning has limited value since it is needed only for the first 4 of 84 months (i.e. just 4% of the life cycle); (3) after the first few months, we need not continually update our defect prediction models. We hope these results inspire other researchers to adopt a 'simplicity-first" approach to their work. Certainly, there are domains that require a complex and data-hungry analysis. But before assuming complexity, it is prudent to check the raw data looking for "short cuts" that simplify the whole analysis.
IEEE Transactions on Software Engineering, 2021
Software comes in releases. An implausible change to software is something that has never been ch... more Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can report the fewest changes needed to alter the classification of some code module (e.g., from "defective" to "non-defective"). TimeLIME is a new tool, introduced in this paper, that improves LIME by restricting its plans to just those attributes which change the most within a project. In this study, we compared the performance of LIME and TimeLIME and several other defect reduction planning algorithms. The generated plans were assessed via (a) the similarity scores between the proposed code changes and the real code changes made by developers; and (b) the improvement scores seen within projects that followed the plans. For nine project trails, we found that TimeLIME outperformed all other algorithms (in 8 out of 9 trials). Hence, we strongly recommend using past releases as a source of knowledge for computing fixes for new releases (using TimeLIME). Apart from these specific results about planning defect reductions and TimeLIME, the more general point of this paper is that our community should be more careful about using off-the-shelf AI tools,without first applying SE knowledge. In this case study, it was not difficult to augment a standard AI algorithm with SE knowledge (that past releases are a good source of knowledge for planning defect reductions). As shown here, once that SE knowledge is applied, this can result in dramatically better systems.
Proceedings of the 15th International Conference on Mining Software Repositories, 2018
Deep learning methods are useful for high-dimensional data and are becoming widely used in many a... more Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of so ware engineering. Deep learners utilizes extensive computational power and can take a long time to train-making it di cult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for nding related Stack Over ow posts, a tuned SVM performs similarly to a deep learner, but is signi cantly faster to train. is paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. is approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Signi cantly, this faster approach generates classi ers nearly as good (within 2% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models).
Empirical Software Engineering, 2020
The current generation of software analytics tools are mostly prediction algorithms (e.g. support... more The current generation of software analytics tools are mostly prediction algorithms (e.g. support vector machines, naive bayes, logistic regression, etc). While prediction is useful, after prediction comes planning about what actions to take in order to improve quality. This research seeks methods that generate demonstrably useful guidance on "what to do" within the context of a specific software project. Specifically, we propose XTREE (for within-project planning) and BELLTREE (for cross-project planning) to generating plans that can improve software quality. Each such plan has the property that, if followed, it reduces the expected number of future defect reports. To find this expected number, planning was first applied to data from release x. Next, we looked for change in release x + 1 that conformed to our plans. This procedure was applied using a range of planners from the literature, as well as XTREE. In 10 open-source JAVA systems, several hundreds of defects were reduced in sections of the code that conformed to XTREE's plans. Further, when compared to other planners, XTREE's plans were found to be easier to implement (since they were shorter) and more effective at reducing the expected number of defects.
IEEE Transactions on Software Engineering, 2018
Transfer learning has been the subject of much recent research. In practice, that research means ... more Transfer learning has been the subject of much recent research. In practice, that research means that the models are unstable since they are continually revised whenever new data arrives. This paper offers a very simple "bellwether" transfer learner. Given N datasets, we find which one produces the best predictions on all the others. This "bellwether" dataset is then used for all subsequent predictions (when its predictions start failing, one may seek another bellwether). Bellwethers are interesting since they are very simple to find (wrap a for-loop around standard data miners). They simplify the task of making general policies in software engineering since as long as one bellwether remains useful, stable conclusions for N datasets can be achieved by reasoning over that bellwether. This paper shows that this bellwether approach works for multiple datasets from various domains in SE. From this, we conclude that (1) bellwether method is a useful (and simple) transfer learner; (2) Unlike bellwethers, other complex transfer learners do not generalized to all domains in SE; (3) "bellwethers" are a baseline method against which future transfer learners should be compared; (4) When building increasingly complex automatic methods, researchers should pause and compare more sophisticated method against simpler alternatives.
Proceedings of the 9th International Conference on Predictive Models in Software Engineering, 2013
SE data mining tools can be reconfigured to define and explore the space of decisions made by a c... more SE data mining tools can be reconfigured to define and explore the space of decisions made by a community.
BACKGROUND: Given many possible changes to a software project, which ones are recommended? AIM: T... more BACKGROUND: Given many possible changes to a software project, which ones are recommended? AIM: To comparatively assess different decision procedures for recommending project changes. METHOD: We search for project recommendations within data from eight projects using various AI tools: six model-based methods and one instance-based method called W2. Results were assessed by comparing effort, defects, development time values in the raw data versus the subset of the data selected by those recommendations. RESULTS: In the majority case, significantly large reductions on effort, defects and development time were achieved. Further, W2 performed as well, or better, than any other methods in this study. W2 does not rely on an underlying model of software process so it does not demand that domain data be expressed in the terminology of that model. Hence, it can be quickly adapted to a new domain and easy to maintain (just add more instances). CONCLUSION: We recommend instance-based methods s...
IEEE Software, 2013
THE PREDICTIVE MODELING community applies data miners to artifacts from software projects. This w... more THE PREDICTIVE MODELING community applies data miners to artifacts from software projects. This work has been very successful-we now know how to build predictive models for software effects and defects and many other tasks such as learning developers' programming patterns (see the extended version of this article at http://menzies.us/pdf/13idea.pdf for more detail). That said, to truly impact the work of industrial practitioners, we need to change the predictive modeling community's focus. To date, it has spent too much time on algorithm mining when the eld is moving into what I call landscape mining. To support industrial practitioners, we're going to have to move on to something I call decision mining and then discussion mining.
2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), 2011
Data miners can infer rules showing how to improve either (a) the effort estimates of a project o... more Data miners can infer rules showing how to improve either (a) the effort estimates of a project or (b) the defect predictions of a software module. Such studies often exhibit conclusion instability regarding what is the most effective action for different projects or modules. This instability can be explained by data heterogeneity. We show that effort and defect data contain many local regions with markedly different properties to the global space. In other words, what appears to be useful in a global context is often irrelevant for particular local contexts. This result raises questions about the generality of conclusions from empirical SE. At the very least, SE researchers should test if their supposedly general conclusions are valid within subsets of their data. At the very most, empirical SE should become a search for local regions with similar properties (and conclusions should be constrained to just those regions).
2013 10th Working Conference on Mining Software Repositories (MSR), 2013
How can we find data for quality prediction? Early in the life cycle, projects may lack the data ... more How can we find data for quality prediction? Early in the life cycle, projects may lack the data needed to build such predictors. Prior work assumed that relevant training data was found nearest to the local project. But is this the best approach? This paper introduces the Peters filter which is based on the following conjecture: When local data is scarce, more information exists in other projects. Accordingly, this filter selects training data via the structure of other projects. To assess the performance of the Peters filter, we compare it with two other approaches for quality prediction. Withincompany learning and cross-company learning with the Burak filter (the state-of-the-art relevancy filter). This paper finds that: 1) within-company predictors are weak for small data-sets; 2) the Peters filter+cross-company builds better predictors than both within-company and the Burak filter+cross-company; and 3) the Peters filter builds 64% more useful predictors than both withincompany and the Burak filter+cross-company approaches. Hence, we recommend the Peters filter for cross-company learning.
2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2013
Defect prediction approaches use software metrics and fault data to learn which software properti... more Defect prediction approaches use software metrics and fault data to learn which software properties associate with faults in classes. Existing techniques predict fault-prone classes in the same release (intra) or in a subsequent releases (inter) of a subject software system. We propose a intrarelease fault prediction technique, which learns from clusters of related classes, rather than from the entire system. Classes are clustered using structural information and fault prediction models are built using the metrics on the classes in each cluster identified. We present an empirical investigation on data from 29 releases of 8 open source software systems from the PROMISE repository, with predictors built using multivariate linear regression. The results indicate that the prediction models built on clusters outperform those built on all the classes of the system.
2009 IEEE/ACM International Conference on Automated Software Engineering, 2009
When AI search methods are applied to software process models, then appropriate technologies can ... more When AI search methods are applied to software process models, then appropriate technologies can be discovered for a software project. We show that those recommendations are greatly affected by the business context of its use. For example, the automatic defect reduction tools explored by the ASE community are only relevant to a subset of software projects, and only according to certain value criteria. Therefore, when arguing for the value of a particular technology, that argument should include a description of the value function of the target user community.
The current generation of software analytics tools are mostly prediction algorithms (e.g. support... more The current generation of software analytics tools are mostly prediction algorithms (e.g. support vector machines, naive bayes, logistic regression, etc). While prediction is useful, after prediction comes planning about what actions to take in order to improve quality. This research seeks methods that generate demonstrably useful guidance on ''what to do'' within the context of a specific software project. Specifically, we propose XTREE (for within-project planning) and BELLTREE (for cross-project planning) to generating plans that can improve software quality. Each such plan has the property that, if followed, it reduces the probability of future defect reports. When compared to other planning algorithms from the SE literature, we find that this new approach is most effective at learning plans from one project, then applying those plans to another. In 10 open-source JAVA systems, several hundreds of defects were reduced in sections of the code that followed the pla...
Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibli... more Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. License This work is licensed under a Creative Commons Attribution 3.0 Unported license: CC-BY. In brief, this license authorizes each and everybody to share (to copy, distribute and transmit) the work under the following conditions, without impairing or restricting the authors' moral rights: Attribution: The work must be attributed to its authors. The copyright is retained by the corresponding authors. Digital Object Identifier: 10.4230/DagRep.4.6.i Aims and Scope The periodical Dagstuhl Reports documents the program and the results of Dagstuhl Seminars and Dagstuhl Perspectives Workshops. In principal, for each Dagstuhl Seminar or Dagstuhl Perspectives Workshop a report is published that contains the following: an executive summary of the seminar program and the fundamental results, an overview of the talks given during the seminar (summarized as talk abstracts), and summaries from working groups (if applicable). This basic framework can be extended by suitable contributions that are related to the program of the seminar, e. g. summaries from panel discussions or open problem sessions.
This report documents the program and the outcomes of Dagstuhl Seminar 14261 "Software Devel... more This report documents the program and the outcomes of Dagstuhl Seminar 14261 "Software Development Analytics". We briefly summarize the goals and format of the seminar, the results of the break out groups, and a draft of a manifesto for software analytics. The report also includes the abstracts of the talks presented at the seminar. Seminar June 22-27, 2014-http://www.dagstuhl.de/14261 1998 ACM Subject Classification D.2 Software Engineering
ArXiv, 2017
Transfer learning has been the subject of much recent research. In practice, that research means ... more Transfer learning has been the subject of much recent research. In practice, that research means that the models are unstable since they are continually revised whenever new data arrives. This paper offers a very simple “bellwether” transfer learner. Given N datasets, we find which one produces the best predictions on all the others. This “bellwether” dataset is then used for all subsequent predictions (when its predictions start failing, one may seek another bellwether). Bellwethers are interesting since they are very simple to find (wrap a for-loop around standard data miners). They simplify the task of making general policies in software engineering since as long as one bellwether remains useful, stable conclusions for N datasets can be achieved by reasoning over that bellwether. This paper shows that this bellwether approach works for multiple datasets from various domains in SE. From this, we conclude that (1) bellwether method is a useful (and simple) transfer learner; (2) Unl...
Before researchers rush to reason across all available data, they should first check if the infor... more Before researchers rush to reason across all available data, they should first check if the information is densest within some small region. We say this since, in 240 GitHub projects, we find that the information in that data “clumps” towards the earliest parts of the project. In fact, a defect prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this early life cycle data, we can build models very quickly (using weeks, not months, of CPU time). Also, we can find simple models (with just two features) that generalize to hundreds of software projects. Based on this experience, we warn that prior work on generalizing software engineering defect prediction models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data now needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data a...
There has been much recent interest in the application of deep learning neural networks in softwa... more There has been much recent interest in the application of deep learning neural networks in software engineering. Some researchers are worried that deep learning is being applied with insufficient critical ssessment. Hence, for one well-studied software analytics task (defect prediction), this paper compares deep learning versus prior-state-of-the-art results. Deep learning will outperform those prior results, but only after adjusting its hyperparameters using GHOST (Goal-oriented Hyperparameter Optimization for Scalable Training). For defect prediction, GHOST terminates in just a few minutes and scales to larger data sets; i.e. it is practical to tune deep learning tuning for defect prediction. Hence this paper recommends deep learning for defect prediction, but only adjusting its goal predicates and tuning its hyperparameters (using some hyperparameter optimization tool, like GHOST)
Despite decades of research, SE lacks widely accepted models (that offer precise quantitative pre... more Despite decades of research, SE lacks widely accepted models (that offer precise quantitative predictions) about what factors most influence software quality. This paper provides a “good news” result that such general models can be generated using a new transfer learning framework called “GENERAL”. Given a tree of recursively clustered projects (using project meta-data), GENERAL promotes a model upwards if it performs best in the lower clusters (stopping when the promoted model performs worse than the models seen at a lower level). The number of models found by GENERAL is minimal: one for defect prediction (756 projects) and less than a dozen for project health (1628 projects). Hence, via GENERAL, it is possible to make conclusions that hold across hundreds of projects at a time. Further, the models produced in this manner offer predictions that perform as well or better than prior state-of-the-art. To the best of our knowledge, this is the largest demonstration of the generalizabil...
2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 2021
Many methods in defect prediction are "datahungry"; i.e. (1) given a choice of using more data, o... more Many methods in defect prediction are "datahungry"; i.e. (1) given a choice of using more data, or some smaller sample, researchers assume that more is better; (2) when data is missing, researchers take elaborate steps to transfer data from another project; and (3) given a choice of older data or some more recent sample, researchers usually ignore older data. Based on the analysis of hundreds of popular Github projects (with 1.2 million commits), we suggest that for defect prediction, there is limited value in such data-hungry approaches. Data for our sample of projects last for 84 months and contains 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, defect predictors learned from the first 150 commits and four months perform just as well as anything else. This means that, contrary to the "data-hungry" approach, (1) small samples of data from these projects are all that is needed for defect prediction; (2) transfer learning has limited value since it is needed only for the first 4 of 84 months (i.e. just 4% of the life cycle); (3) after the first few months, we need not continually update our defect prediction models. We hope these results inspire other researchers to adopt a 'simplicity-first" approach to their work. Certainly, there are domains that require a complex and data-hungry analysis. But before assuming complexity, it is prudent to check the raw data looking for "short cuts" that simplify the whole analysis.
IEEE Transactions on Software Engineering, 2021
Software comes in releases. An implausible change to software is something that has never been ch... more Software comes in releases. An implausible change to software is something that has never been changed in prior releases. When planning how to reduce defects, it is better to use plausible changes, i.e., changes with some precedence in the prior releases. To demonstrate these points, this paper compares several defect reduction planning tools. LIME is a local sensitivity analysis tool that can report the fewest changes needed to alter the classification of some code module (e.g., from "defective" to "non-defective"). TimeLIME is a new tool, introduced in this paper, that improves LIME by restricting its plans to just those attributes which change the most within a project. In this study, we compared the performance of LIME and TimeLIME and several other defect reduction planning algorithms. The generated plans were assessed via (a) the similarity scores between the proposed code changes and the real code changes made by developers; and (b) the improvement scores seen within projects that followed the plans. For nine project trails, we found that TimeLIME outperformed all other algorithms (in 8 out of 9 trials). Hence, we strongly recommend using past releases as a source of knowledge for computing fixes for new releases (using TimeLIME). Apart from these specific results about planning defect reductions and TimeLIME, the more general point of this paper is that our community should be more careful about using off-the-shelf AI tools,without first applying SE knowledge. In this case study, it was not difficult to augment a standard AI algorithm with SE knowledge (that past releases are a good source of knowledge for planning defect reductions). As shown here, once that SE knowledge is applied, this can result in dramatically better systems.
Proceedings of the 15th International Conference on Mining Software Repositories, 2018
Deep learning methods are useful for high-dimensional data and are becoming widely used in many a... more Deep learning methods are useful for high-dimensional data and are becoming widely used in many areas of so ware engineering. Deep learners utilizes extensive computational power and can take a long time to train-making it di cult to widely validate and repeat and improve their results. Further, they are not the best solution in all domains. For example, recent results show that for nding related Stack Over ow posts, a tuned SVM performs similarly to a deep learner, but is signi cantly faster to train. is paper extends that recent result by clustering the dataset, then tuning very learners within each cluster. is approach is over 500 times faster than deep learning (and over 900 times faster if we use all the cores on a standard laptop computer). Signi cantly, this faster approach generates classi ers nearly as good (within 2% F1 Score) as the much slower deep learning method. Hence we recommend this faster methods since it is much easier to reproduce and utilizes far fewer CPU resources. More generally, we recommend that before researchers release research results, that they compare their supposedly sophisticated methods against simpler alternatives (e.g applying simpler learners to build local models).
Empirical Software Engineering, 2020
The current generation of software analytics tools are mostly prediction algorithms (e.g. support... more The current generation of software analytics tools are mostly prediction algorithms (e.g. support vector machines, naive bayes, logistic regression, etc). While prediction is useful, after prediction comes planning about what actions to take in order to improve quality. This research seeks methods that generate demonstrably useful guidance on "what to do" within the context of a specific software project. Specifically, we propose XTREE (for within-project planning) and BELLTREE (for cross-project planning) to generating plans that can improve software quality. Each such plan has the property that, if followed, it reduces the expected number of future defect reports. To find this expected number, planning was first applied to data from release x. Next, we looked for change in release x + 1 that conformed to our plans. This procedure was applied using a range of planners from the literature, as well as XTREE. In 10 open-source JAVA systems, several hundreds of defects were reduced in sections of the code that conformed to XTREE's plans. Further, when compared to other planners, XTREE's plans were found to be easier to implement (since they were shorter) and more effective at reducing the expected number of defects.
IEEE Transactions on Software Engineering, 2018
Transfer learning has been the subject of much recent research. In practice, that research means ... more Transfer learning has been the subject of much recent research. In practice, that research means that the models are unstable since they are continually revised whenever new data arrives. This paper offers a very simple "bellwether" transfer learner. Given N datasets, we find which one produces the best predictions on all the others. This "bellwether" dataset is then used for all subsequent predictions (when its predictions start failing, one may seek another bellwether). Bellwethers are interesting since they are very simple to find (wrap a for-loop around standard data miners). They simplify the task of making general policies in software engineering since as long as one bellwether remains useful, stable conclusions for N datasets can be achieved by reasoning over that bellwether. This paper shows that this bellwether approach works for multiple datasets from various domains in SE. From this, we conclude that (1) bellwether method is a useful (and simple) transfer learner; (2) Unlike bellwethers, other complex transfer learners do not generalized to all domains in SE; (3) "bellwethers" are a baseline method against which future transfer learners should be compared; (4) When building increasingly complex automatic methods, researchers should pause and compare more sophisticated method against simpler alternatives.
Proceedings of the 9th International Conference on Predictive Models in Software Engineering, 2013
SE data mining tools can be reconfigured to define and explore the space of decisions made by a c... more SE data mining tools can be reconfigured to define and explore the space of decisions made by a community.
BACKGROUND: Given many possible changes to a software project, which ones are recommended? AIM: T... more BACKGROUND: Given many possible changes to a software project, which ones are recommended? AIM: To comparatively assess different decision procedures for recommending project changes. METHOD: We search for project recommendations within data from eight projects using various AI tools: six model-based methods and one instance-based method called W2. Results were assessed by comparing effort, defects, development time values in the raw data versus the subset of the data selected by those recommendations. RESULTS: In the majority case, significantly large reductions on effort, defects and development time were achieved. Further, W2 performed as well, or better, than any other methods in this study. W2 does not rely on an underlying model of software process so it does not demand that domain data be expressed in the terminology of that model. Hence, it can be quickly adapted to a new domain and easy to maintain (just add more instances). CONCLUSION: We recommend instance-based methods s...
IEEE Software, 2013
THE PREDICTIVE MODELING community applies data miners to artifacts from software projects. This w... more THE PREDICTIVE MODELING community applies data miners to artifacts from software projects. This work has been very successful-we now know how to build predictive models for software effects and defects and many other tasks such as learning developers' programming patterns (see the extended version of this article at http://menzies.us/pdf/13idea.pdf for more detail). That said, to truly impact the work of industrial practitioners, we need to change the predictive modeling community's focus. To date, it has spent too much time on algorithm mining when the eld is moving into what I call landscape mining. To support industrial practitioners, we're going to have to move on to something I call decision mining and then discussion mining.
2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), 2011
Data miners can infer rules showing how to improve either (a) the effort estimates of a project o... more Data miners can infer rules showing how to improve either (a) the effort estimates of a project or (b) the defect predictions of a software module. Such studies often exhibit conclusion instability regarding what is the most effective action for different projects or modules. This instability can be explained by data heterogeneity. We show that effort and defect data contain many local regions with markedly different properties to the global space. In other words, what appears to be useful in a global context is often irrelevant for particular local contexts. This result raises questions about the generality of conclusions from empirical SE. At the very least, SE researchers should test if their supposedly general conclusions are valid within subsets of their data. At the very most, empirical SE should become a search for local regions with similar properties (and conclusions should be constrained to just those regions).
2013 10th Working Conference on Mining Software Repositories (MSR), 2013
How can we find data for quality prediction? Early in the life cycle, projects may lack the data ... more How can we find data for quality prediction? Early in the life cycle, projects may lack the data needed to build such predictors. Prior work assumed that relevant training data was found nearest to the local project. But is this the best approach? This paper introduces the Peters filter which is based on the following conjecture: When local data is scarce, more information exists in other projects. Accordingly, this filter selects training data via the structure of other projects. To assess the performance of the Peters filter, we compare it with two other approaches for quality prediction. Withincompany learning and cross-company learning with the Burak filter (the state-of-the-art relevancy filter). This paper finds that: 1) within-company predictors are weak for small data-sets; 2) the Peters filter+cross-company builds better predictors than both within-company and the Burak filter+cross-company; and 3) the Peters filter builds 64% more useful predictors than both withincompany and the Burak filter+cross-company approaches. Hence, we recommend the Peters filter for cross-company learning.
2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2013
Defect prediction approaches use software metrics and fault data to learn which software properti... more Defect prediction approaches use software metrics and fault data to learn which software properties associate with faults in classes. Existing techniques predict fault-prone classes in the same release (intra) or in a subsequent releases (inter) of a subject software system. We propose a intrarelease fault prediction technique, which learns from clusters of related classes, rather than from the entire system. Classes are clustered using structural information and fault prediction models are built using the metrics on the classes in each cluster identified. We present an empirical investigation on data from 29 releases of 8 open source software systems from the PROMISE repository, with predictors built using multivariate linear regression. The results indicate that the prediction models built on clusters outperform those built on all the classes of the system.