Jesus Gonzalez-Barahona - Academia.edu (original) (raw)
Papers by Jesus Gonzalez-Barahona
In this data paper we describe a data set obtained by means of performing an on-line survey to ov... more In this data paper we describe a data set obtained by means of performing an on-line survey to over 2,000 Free/Libre/Open Source Software (FLOSS) contributors. The survey includes questions related to personal characteristics (gender, age, civil status, nationality, etc.), education and level of English, professional status, dedication to FLOSS projects, reasons and motivations, involvement and goals. We describe as well the possibilities and challenges of using private information from the survey when linked with other, publicly available data sources. In this regard, an example of data sharing will be presented and legal, ethical and technical issues will be discussed.
Libre (free, open source) software is providing huge quantities of data suitable to be used in st... more Libre (free, open source) software is providing huge quantities of data suitable to be used in studies of software evolution. Many different aspects of its development process can be studied from data available in public repositories, ranging from the source code in release archives to mailing lists, bug report systems and version control systems. There are already several software evolution
2009 42nd Hawaii International Conference on System Sciences, 2009
Developer turnover can result in a major problem when developing software. When senior developers... more Developer turnover can result in a major problem when developing software. When senior developers abandon a software project, they leave a knowledge gap that has to be managed. In addition, new (junior) developers require some time in order to achieve the desired level of productivity. In this paper, we present a methodology to measure the effect of knowledge loss due to developer turnover in software projects. For a given software project, we measure the quantity of code that has been authored by developers that do not belong to the current development team, which we define as orphaned code. Besides, we study how orphaned code is managed by the project. Our methodology is based on the concept of software archaeology, a derivation of software evolution. As case studies we have selected four FLOSS (free, libre, open source software) projects, from purely driven by volunteers to company-supported. The application of our methodology to these case studies will give insight into the turnover that these projects suffer and how they have managed it and shows that this methodology is worth being augmented in future research.
This paper examines the claim that libre (free, open source) software involves global development... more This paper examines the claim that libre (free, open source) software involves global development. The anecdotal evidence is that developers usually work in teams including individuals residing in many different geographical areas, time zones and even continents and that, ...
"Collaboration, Conflict and Control: The 4th Workshop on Open Source Software Engineering" W8S Workshop - 26th International Conference on Software Engineering, 2004
Abstract The relationships among modules in a software project of a certain size can give us much... more Abstract The relationships among modules in a software project of a certain size can give us much information about its internal organization and a way to control and monitor development activities and evolution of large libre software projects. In this paper, we show how information available in CVS repositories can be used to study the structure of the modules in a project when they are related by the people working in them, and how techniques taken from the social networks fields can be used to highlight the ...
Studying the evolution of libre software projects using publicly available data Gregorio Robles-M... more Studying the evolution of libre software projects using publicly available data Gregorio Robles-Martínez, Jesús M. González-Barahona, José Centeno-González, Vicente Matellán-Olivera, and Luis Rodero-Merino GSyC, Universidad Rey Juan Carlos {grex, ...
IFIP — The International Federation for Information Processing, 2007
The tutorial will begin with reviews of the main source code repositories, including popular code... more The tutorial will begin with reviews of the main source code repositories, including popular code forges such as Sourceforge, and techniques for collecting data directly from the forges as well as from aggregation projects such as FLOSSmole1. The tutorial will then discuss tools designed for analyzing the data found on forges, such as CVSAnalY2, Pyternity, and SLOCCount, among others. Most importantly, participants will have a chance to analyze data with the help of the presenters. Teams of participants will solve open-ended analysis problems ...
IFIP — The International Federation for Information Processing, 2007
Exchange of detailed data about software development between research teams, and specifically abo... more Exchange of detailed data about software development between research teams, and specifically about data available from public repositories of libre (free, open source) software projects is becoming more and more common. This workshop will explore the benefits and problems of such exchange, and the steps needed to foster it. As a case example of data exchange, the workshop organizers suggest two large datasets to be analyzed by participants.
IFIP — The International Federation for Information Processing, 2007
Abstract. Although much of the research on the libre (free, open source) phenomenon has been focu... more Abstract. Although much of the research on the libre (free, open source) phenomenon has been focused on the involvement of volunteers, the role of companies is also important in many projects. In fact, during the last years, the involvement of companies in the libre software ...
... Louis, Missouri, USA, May 2005. [4] Gregorio Robles, Jesús M. González-Barahona, José Centeno... more ... Louis, Missouri, USA, May 2005. [4] Gregorio Robles, Jesús M. González-Barahona, José Centeno-González, Vicente Matellán-Olivera, and Luis Rodero-Merino. Studying the evolution of libre soft-ware projects using publicly available data. ...
Libre (free, open source) software is one of the paradig- matic cases where heavy use of telemati... more Libre (free, open source) software is one of the paradig- matic cases where heavy use of telematic tools and user- driven software development are key points. This paper proposes a methodology for measuring and analyzing re- motely big libre software projects using publicly-available data from their version control repositories. By means of a tool called CVSAnalY that has been implemented
Libre (free/open source) software provides an ample range of publicly available data sources abou... more Libre (free/open source) software provides an ample range of publicly available data sources about its development, which can be retrieved and analyzed. Consequently, it offers a good opportunity to build predictive estimation and evolution models. The main challenge to understand libre software development is that its development nature is radically different from 'classical' in-house software development, common in industry in the last decades. Developers and other human resources are generally a mixture of a few hired developers and many volunteers whose contribution (in number of hours per week and in total time devoted to the project) is not foreseeable in advance. This paper is a first step in finding predictive models in the libre software world. We have studied three data repositories (versioning system, mailing lists and bug tracking system) of GNOME, a large libre software project with several thousand contributors and several millions of lines of code, measuring activity and participation in it during the last years. Results and correlations for these sources allow us to adventure some first estimations of how participation and activity will evolve in the future.
Proceedings - ICSE 2007 Workshops: Fourth International Workshop on Mining Software Repositories, MSR 2007, 2007
During 2003, the Mozilla project transitioned from company-promoted (sponsored by AOL) to communi... more During 2003, the Mozilla project transitioned from company-promoted (sponsored by AOL) to communitypromoted (sponsored by the Mozilla Foundation). What happened to the group of developers during this transition? There was any significant impact on its activity or composition? To answer these questions, we have performed an analysis of the CVS repository of Mozilla, using the CVSAnalY tool, finding little on activity, but dramatic changes in the the composition of the development team.
Proceedings - ICSE 2007 Workshops: Fourth International Workshop on Mining Software Repositories, MSR 2007, 2007
In order to predict the number of changes in the following months for the project Eclipse, we hav... more In order to predict the number of changes in the following months for the project Eclipse, we have applied a statistical (non-explanatory) model based on time series analysis. We have obtained the monthly number of changes in the CVS repository of Eclipse, using the CVSAnalY tool. The input to our model was the filtered series of the number of changes per month, and the output was the number of changes per month for the next three months. Then we aggregated the results of the three months to obtain the total number of changes in the given period in the challenge.
International Workshop on Principles of Software Evolution (IWPSE), 2009
What is the future of software evolution? In 1974, Meir M. Lehman had a vision of software evolut... more What is the future of software evolution? In 1974, Meir M. Lehman had a vision of software evolution being driven by empirical studies of software repositories, and of a theory based on those empirical results. However, that scenario is yet to come. Software evolution studies are often based on a few cases, because the needed information is scarce, dispersed and incomplete. Their conclusions are not generalizable, slowing down the progress of this research discipline. Libre (free / open source) software supposes an opportunity to alleviate this situation. In this paper we describe the existing approaches to provide research datasets that are mining libre software repositories, and propose an agenda based on the concept of research friendly software repositories, which provides finer granularity and integrated data.
Concepts, Methodologies, Tools, and Applications, 2009
Due to the open nature of Free/Libre/Open Source software projects, researchers have gained acces... more Due to the open nature of Free/Libre/Open Source software projects, researchers have gained access to a rich set of development-related information. Although this information is publicly available on the Internet, obtaining and analyzing it in a convenient way is not an easy task and many considerations have to be taken into account. In this paper we present the most important data sources that can be found in libre software projects and that are studied by the research community: source code, source code management systems, mailing lists and bug tracking systems. We will give advice for the problems that can be found when retrieving and preparing the data sources for a posterior analysis, as well as provide information about the tools that support these tasks.
Proceedings of the 2008 international workshop on Mining software repositories - MSR '08, 2008
We believe that the bug report form of Eclipse contains too many fields, and that for some fields... more We believe that the bug report form of Eclipse contains too many fields, and that for some fields, there are too many options. In this MSR challenge report, we focus in the case of the severity field. That field contains seven different levels of severity. Some of them seem very similar, and it is hard to distinguish among them. Users assign severity, and developers give priority to the reports depending on their severity. However, if users can not distinguish well among the various severity options, they will probably assign different priorities to bugs that require the same priority. We study the mean time to close bugs reported in Eclipse, and how the severity assigned by users affects this time. The results shows that classifying by time to close, there are less clusters of bugs than levels of severity. We therefore conclude that there is a need to make a simpler bug report form.
2007 IEEE International Conference on Software Maintenance, 2007
, in many cases volunteers, interact in complex patterns without the constrains of formal hierarc... more , in many cases volunteers, interact in complex patterns without the constrains of formal hierarchical structures or organizational ties. Understanding this complex behavior with enough detail to build explanatory models suitable for prediction is an open challenge, and few results have been published to date in this area. Therefore statistical, non-explanatory models (such as the traditional regression model) have a clear role, and have been used in some evolution studies. Our proposal goes in this direction, but using a model that we have found more useful: time series analysis. Data available from the source code management repository is used to compute the size of the software over its past life, using this information to estimate the future evolution of the project. In this paper we present this methodology and apply it to three large projects, showing how in these cases predictions are more accurate than regression models, and precise enough to estimate with little error their near future evolutions. * This work has been funded in part by the European Commission, under the FLOSSMETRICS (FP6-IST-5-033982) and QUALOSS (FP6-IST-5-033547) projects. Israel Herraiz has been funded in part by Consejería de Educación of Comunidad de Madrid and European Social Fund, under grant number 01/FPI/0582/2005.
14th Working Conference on Reverse Engineering (WCRE 2007), 2007
The notion of functional or modular dependency is fun-damental to understand the architecture and... more The notion of functional or modular dependency is fun-damental to understand the architecture and inner work-ings of any software system. In this paper, we propose to extend that notion to consider dependencies at a larger scale, between software applications (usually ...
Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, 2006
There are some concerns in the research community about the convenience of using low-level metric... more There are some concerns in the research community about the convenience of using low-level metrics (such as SLOC, source lines of code) for characterizing the evolution of software, instead of the more traditional higher lever metrics (such as the number of modules or files). This issue has been raised in particular after some studies that suggest that libre (free, open source) software evolves differently than 'traditional' software, and therefore it does not conform to Lehman's laws of software evolution. Since those studies on libre software evolution use SLOCs as the base metric, while Lehman's and other traditional studies use modules or files, it is difficult to compare both cases. To overcome this difficulty, and to explore the differences between SLOC and files/modules counts in libre software projects, we have selected a large sample of programs and have calculated both size metrics over time. Our study shows that in those cases the evolution patterns in both cases (counting SLOCs or files) is the same, and that some patterns not conforming to Lehman's laws are indeed apparent.
In this data paper we describe a data set obtained by means of performing an on-line survey to ov... more In this data paper we describe a data set obtained by means of performing an on-line survey to over 2,000 Free/Libre/Open Source Software (FLOSS) contributors. The survey includes questions related to personal characteristics (gender, age, civil status, nationality, etc.), education and level of English, professional status, dedication to FLOSS projects, reasons and motivations, involvement and goals. We describe as well the possibilities and challenges of using private information from the survey when linked with other, publicly available data sources. In this regard, an example of data sharing will be presented and legal, ethical and technical issues will be discussed.
Libre (free, open source) software is providing huge quantities of data suitable to be used in st... more Libre (free, open source) software is providing huge quantities of data suitable to be used in studies of software evolution. Many different aspects of its development process can be studied from data available in public repositories, ranging from the source code in release archives to mailing lists, bug report systems and version control systems. There are already several software evolution
2009 42nd Hawaii International Conference on System Sciences, 2009
Developer turnover can result in a major problem when developing software. When senior developers... more Developer turnover can result in a major problem when developing software. When senior developers abandon a software project, they leave a knowledge gap that has to be managed. In addition, new (junior) developers require some time in order to achieve the desired level of productivity. In this paper, we present a methodology to measure the effect of knowledge loss due to developer turnover in software projects. For a given software project, we measure the quantity of code that has been authored by developers that do not belong to the current development team, which we define as orphaned code. Besides, we study how orphaned code is managed by the project. Our methodology is based on the concept of software archaeology, a derivation of software evolution. As case studies we have selected four FLOSS (free, libre, open source software) projects, from purely driven by volunteers to company-supported. The application of our methodology to these case studies will give insight into the turnover that these projects suffer and how they have managed it and shows that this methodology is worth being augmented in future research.
This paper examines the claim that libre (free, open source) software involves global development... more This paper examines the claim that libre (free, open source) software involves global development. The anecdotal evidence is that developers usually work in teams including individuals residing in many different geographical areas, time zones and even continents and that, ...
"Collaboration, Conflict and Control: The 4th Workshop on Open Source Software Engineering" W8S Workshop - 26th International Conference on Software Engineering, 2004
Abstract The relationships among modules in a software project of a certain size can give us much... more Abstract The relationships among modules in a software project of a certain size can give us much information about its internal organization and a way to control and monitor development activities and evolution of large libre software projects. In this paper, we show how information available in CVS repositories can be used to study the structure of the modules in a project when they are related by the people working in them, and how techniques taken from the social networks fields can be used to highlight the ...
Studying the evolution of libre software projects using publicly available data Gregorio Robles-M... more Studying the evolution of libre software projects using publicly available data Gregorio Robles-Martínez, Jesús M. González-Barahona, José Centeno-González, Vicente Matellán-Olivera, and Luis Rodero-Merino GSyC, Universidad Rey Juan Carlos {grex, ...
IFIP — The International Federation for Information Processing, 2007
The tutorial will begin with reviews of the main source code repositories, including popular code... more The tutorial will begin with reviews of the main source code repositories, including popular code forges such as Sourceforge, and techniques for collecting data directly from the forges as well as from aggregation projects such as FLOSSmole1. The tutorial will then discuss tools designed for analyzing the data found on forges, such as CVSAnalY2, Pyternity, and SLOCCount, among others. Most importantly, participants will have a chance to analyze data with the help of the presenters. Teams of participants will solve open-ended analysis problems ...
IFIP — The International Federation for Information Processing, 2007
Exchange of detailed data about software development between research teams, and specifically abo... more Exchange of detailed data about software development between research teams, and specifically about data available from public repositories of libre (free, open source) software projects is becoming more and more common. This workshop will explore the benefits and problems of such exchange, and the steps needed to foster it. As a case example of data exchange, the workshop organizers suggest two large datasets to be analyzed by participants.
IFIP — The International Federation for Information Processing, 2007
Abstract. Although much of the research on the libre (free, open source) phenomenon has been focu... more Abstract. Although much of the research on the libre (free, open source) phenomenon has been focused on the involvement of volunteers, the role of companies is also important in many projects. In fact, during the last years, the involvement of companies in the libre software ...
... Louis, Missouri, USA, May 2005. [4] Gregorio Robles, Jesús M. González-Barahona, José Centeno... more ... Louis, Missouri, USA, May 2005. [4] Gregorio Robles, Jesús M. González-Barahona, José Centeno-González, Vicente Matellán-Olivera, and Luis Rodero-Merino. Studying the evolution of libre soft-ware projects using publicly available data. ...
Libre (free, open source) software is one of the paradig- matic cases where heavy use of telemati... more Libre (free, open source) software is one of the paradig- matic cases where heavy use of telematic tools and user- driven software development are key points. This paper proposes a methodology for measuring and analyzing re- motely big libre software projects using publicly-available data from their version control repositories. By means of a tool called CVSAnalY that has been implemented
Libre (free/open source) software provides an ample range of publicly available data sources abou... more Libre (free/open source) software provides an ample range of publicly available data sources about its development, which can be retrieved and analyzed. Consequently, it offers a good opportunity to build predictive estimation and evolution models. The main challenge to understand libre software development is that its development nature is radically different from 'classical' in-house software development, common in industry in the last decades. Developers and other human resources are generally a mixture of a few hired developers and many volunteers whose contribution (in number of hours per week and in total time devoted to the project) is not foreseeable in advance. This paper is a first step in finding predictive models in the libre software world. We have studied three data repositories (versioning system, mailing lists and bug tracking system) of GNOME, a large libre software project with several thousand contributors and several millions of lines of code, measuring activity and participation in it during the last years. Results and correlations for these sources allow us to adventure some first estimations of how participation and activity will evolve in the future.
Proceedings - ICSE 2007 Workshops: Fourth International Workshop on Mining Software Repositories, MSR 2007, 2007
During 2003, the Mozilla project transitioned from company-promoted (sponsored by AOL) to communi... more During 2003, the Mozilla project transitioned from company-promoted (sponsored by AOL) to communitypromoted (sponsored by the Mozilla Foundation). What happened to the group of developers during this transition? There was any significant impact on its activity or composition? To answer these questions, we have performed an analysis of the CVS repository of Mozilla, using the CVSAnalY tool, finding little on activity, but dramatic changes in the the composition of the development team.
Proceedings - ICSE 2007 Workshops: Fourth International Workshop on Mining Software Repositories, MSR 2007, 2007
In order to predict the number of changes in the following months for the project Eclipse, we hav... more In order to predict the number of changes in the following months for the project Eclipse, we have applied a statistical (non-explanatory) model based on time series analysis. We have obtained the monthly number of changes in the CVS repository of Eclipse, using the CVSAnalY tool. The input to our model was the filtered series of the number of changes per month, and the output was the number of changes per month for the next three months. Then we aggregated the results of the three months to obtain the total number of changes in the given period in the challenge.
International Workshop on Principles of Software Evolution (IWPSE), 2009
What is the future of software evolution? In 1974, Meir M. Lehman had a vision of software evolut... more What is the future of software evolution? In 1974, Meir M. Lehman had a vision of software evolution being driven by empirical studies of software repositories, and of a theory based on those empirical results. However, that scenario is yet to come. Software evolution studies are often based on a few cases, because the needed information is scarce, dispersed and incomplete. Their conclusions are not generalizable, slowing down the progress of this research discipline. Libre (free / open source) software supposes an opportunity to alleviate this situation. In this paper we describe the existing approaches to provide research datasets that are mining libre software repositories, and propose an agenda based on the concept of research friendly software repositories, which provides finer granularity and integrated data.
Concepts, Methodologies, Tools, and Applications, 2009
Due to the open nature of Free/Libre/Open Source software projects, researchers have gained acces... more Due to the open nature of Free/Libre/Open Source software projects, researchers have gained access to a rich set of development-related information. Although this information is publicly available on the Internet, obtaining and analyzing it in a convenient way is not an easy task and many considerations have to be taken into account. In this paper we present the most important data sources that can be found in libre software projects and that are studied by the research community: source code, source code management systems, mailing lists and bug tracking systems. We will give advice for the problems that can be found when retrieving and preparing the data sources for a posterior analysis, as well as provide information about the tools that support these tasks.
Proceedings of the 2008 international workshop on Mining software repositories - MSR '08, 2008
We believe that the bug report form of Eclipse contains too many fields, and that for some fields... more We believe that the bug report form of Eclipse contains too many fields, and that for some fields, there are too many options. In this MSR challenge report, we focus in the case of the severity field. That field contains seven different levels of severity. Some of them seem very similar, and it is hard to distinguish among them. Users assign severity, and developers give priority to the reports depending on their severity. However, if users can not distinguish well among the various severity options, they will probably assign different priorities to bugs that require the same priority. We study the mean time to close bugs reported in Eclipse, and how the severity assigned by users affects this time. The results shows that classifying by time to close, there are less clusters of bugs than levels of severity. We therefore conclude that there is a need to make a simpler bug report form.
2007 IEEE International Conference on Software Maintenance, 2007
, in many cases volunteers, interact in complex patterns without the constrains of formal hierarc... more , in many cases volunteers, interact in complex patterns without the constrains of formal hierarchical structures or organizational ties. Understanding this complex behavior with enough detail to build explanatory models suitable for prediction is an open challenge, and few results have been published to date in this area. Therefore statistical, non-explanatory models (such as the traditional regression model) have a clear role, and have been used in some evolution studies. Our proposal goes in this direction, but using a model that we have found more useful: time series analysis. Data available from the source code management repository is used to compute the size of the software over its past life, using this information to estimate the future evolution of the project. In this paper we present this methodology and apply it to three large projects, showing how in these cases predictions are more accurate than regression models, and precise enough to estimate with little error their near future evolutions. * This work has been funded in part by the European Commission, under the FLOSSMETRICS (FP6-IST-5-033982) and QUALOSS (FP6-IST-5-033547) projects. Israel Herraiz has been funded in part by Consejería de Educación of Comunidad de Madrid and European Social Fund, under grant number 01/FPI/0582/2005.
14th Working Conference on Reverse Engineering (WCRE 2007), 2007
The notion of functional or modular dependency is fun-damental to understand the architecture and... more The notion of functional or modular dependency is fun-damental to understand the architecture and inner work-ings of any software system. In this paper, we propose to extend that notion to consider dependencies at a larger scale, between software applications (usually ...
Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, 2006
There are some concerns in the research community about the convenience of using low-level metric... more There are some concerns in the research community about the convenience of using low-level metrics (such as SLOC, source lines of code) for characterizing the evolution of software, instead of the more traditional higher lever metrics (such as the number of modules or files). This issue has been raised in particular after some studies that suggest that libre (free, open source) software evolves differently than 'traditional' software, and therefore it does not conform to Lehman's laws of software evolution. Since those studies on libre software evolution use SLOCs as the base metric, while Lehman's and other traditional studies use modules or files, it is difficult to compare both cases. To overcome this difficulty, and to explore the differences between SLOC and files/modules counts in libre software projects, we have selected a large sample of programs and have calculated both size metrics over time. Our study shows that in those cases the evolution patterns in both cases (counting SLOCs or files) is the same, and that some patterns not conforming to Lehman's laws are indeed apparent.