Multi-Disciplinary Advancement in Open Source Software and Processes

Multi-Disciplinary Advancement in Open Source Software and Processes

Stefan Koch (Bogazici University, Turkey)
Release Date: March, 2011|Copyright: © 2011 |Pages: 308|DOI: 10.4018/978-1-60960-513-1
ISBN13: 9781609605131|ISBN10: 1609605136|EISBN13: 9781609605148
Hardcover:
Available
$180.00
TOTAL SAVINGS: $180.00
Benefits
  • Free shipping on orders $395+
  • Printed-On-Demand (POD)
  • Usually ships one day from order
  • 50% discount on 5+ titles*
E-Book:
(Multi-User License)
Available
$180.00
TOTAL SAVINGS: $180.00
Benefits
  • Multi-user license (no added fee)
  • Immediate access after purchase
  • No DRM
  • ePub with PDF download
  • 50% discount on 5+ titles*
Hardcover +
E-Book:
(Multi-User License)
Available
$215.00
TOTAL SAVINGS: $215.00
Benefits
  • Free shipping on orders $395+
  • Printed-On-Demand (POD)
  • Usually ships one day from order
  • Multi-user license (no added fee)
  • Immediate access after purchase
  • No DRM
  • ePub with PDF download
  • 50% discount on 5+ titles*
OnDemand:
(Individual Chapters)
Available
$37.50
TOTAL SAVINGS: $37.50
Benefits
  • Purchase individual chapters from this book
  • Immediate PDF download after purchase or access through your personal library
  • 50% discount on 5+ titles*

Description

By its very nature, free and open source software encourages collaboration within and across virtual teams and promotes interdisciplinary methods and perspectives.

Multi-Disciplinary Advancement in Open Source Software and Processes reviews the development, design, and use of free and open source software, providing relevant topics of discussion for programmers, as well as researchers in human-computer studies, online and virtual collaboration, and e-learning. This reference explores successes and failures in the discipline, providing a foundation for future research and new approaches for the development and application of free and open source projects.

Topics Covered

The many academic areas covered in this publication include, but are not limited to:

  • Agile and free software approaches
  • Collaboration in open source
  • Competition between open source and proprietary software
  • Floss projects
  • Multiple open source code forges
  • Open source communities
  • Open source software adoption
  • Software Engineering Education
  • Unusual data sources

Table of Contents and List of Contributors

Search this Book:
Reset

Preface

ABSTRACT

In this paper we will review literature on open source software development projects, focusing on their expended effort as well as achieved efficiency. The paper is based mostly on empirical analyses of open source communities. The main method employed is mining the associated software repositories which are publicly available and contain records of past interactions between participants, or between participants and the repository itself. We will present a short overview concerning this methodology, and reference some papers which deal with the technicalities in more detail. From that we will distill some characteristics found in open source software development projects, which includes foremost the high concentration on a few heads. We will then focus on the estimated effort expended for open source development projects, which in comparison to commercial organizational settings seems to be quite small, and relate these points to the efficiency that this form of software development achieves.

Keywords: Open Source, Effort Estimation, Efficiency, Software Quality, Software Repository Mining.

1. INTRODUCTION

In the last years, free and open source software, i.e. software under a license that grants several rights like free redistribution to the user, has become more and more important, with this importance now stretching beyond the mere use of well-known projects in both private and commercial settings (Fitzgerald, 2006; von Krogh and Spaeth, 2007). Examples for this type of adoption naturally include the operating system Linux with the utilities of the GNU project and various desktops as well as office packages, the Apache web server, data bases such as MySQL and many others.

It should be noted that several terms are in use within this field, most notably open source software and free software, which need to be discussed briefly. The term open source as used by the Open Source Initiative (OSI) is defined using the Open Source Definition (Perens, 1999), which lists a number of rights a license has to grant in order to constitute an open source license. These include most notably free redistribution, inclusion of source code, to allow for derived works which can be redistributed under the same license, integrity of author's source code, absence of discrimination against persons, groups or fields of endeavor, and some clauses for the license itself, its distribution, and that it must neither be specific to a product nor contaminate other software. The Free Software Foundation (FSF) advocates the term free software, explicitly alluding to “free” as in “free speech”, not as in “free beer” (Stallman, 2002), which defines a software as free if the user has the freedom to run the program, for any purpose, to study how the program works, and adapt it to his needs, to redistribute copies and to improve the program, and release these improvements to the public. According to this definition, open source and free software are largely interchangeable. The GNU project itself prefers copylefted software, which is free software whose distribution terms do not let re-distributors add any additional restrictions when they redistribute or modify the software. This means that every copy of the software, even if it has been modified, must be free software. This is a more stringent proposition than found in the Open Source Definition, which just allows this. The most well-known and important free and open source license, the GNU General Public License (GPL) is an example for such a copyleft license, with the associated viral characteristics, as any program using or built upon GPLed software must itself be under GPL. There are a number of other licenses, some of which can be considered copyleft, like the X11 license or clarified versions of the original, vague Artistic License, and others which can be considered free or open source, like BSD, Apache or the Mozilla Public License and Sun Public License. In this paper, the term open source is used to refer to free software as well, if a particular license is of importance, this is noted.

More importantly, open source software is not only unique in its licenses and legal implications, but also in its development process and organization of work. The seminal work on this topic was written by Eric S. Raymond, ‘The Cathedral and the Bazaar’, in which he contrasts the traditional type of software development of a few people planning a cathedral in splendid isolation with the new collaborative bazaar form of open source software development (Raymond, 1999). In this, a large number of developer-turned users come together without monetary compensation to cooperate under a model of rigorous peer-review and take advantage of parallel debugging that leads to innovation and rapid advancement in developing and evolving software products. In order to allow for this to happen and to minimize duplicated work, the source code of the software needs to be accessible which necessitates suitable licenses, and new versions need to be released in short cycles.

This theoretical work is going to be our starting point in examining open source software development projects, and the implications for effort and efficiency as well as quality. It is necessary to base this discussion on empirical assessments of real-world projects. Empirical research on open source processes often employs the analysis of data available through mining the communication and coordination tools and their repositories. For this paper, we will mostly focus on this approach and results from several studies using it. Other approaches taken include ethnographic studies of development communities (Coleman and Hill, 2004; Elliott and Scacchi, 2004), sometimes coupled with repository mining (Basset, 2004).

The structure of this paper is as follows: We will start with a short introduction to mining software repositories, and then provide a discussion and empirical data concerning characteristics of open source software development projects based on this methodology. Then we are going to focus on the implications that these aspects have on effort, as well as efficiency and quality.

2. SOFTWARE REPOSITORY MINING

Software development repositories contain a plethora of information on software and underlying, associated development processes (Cook et al., 1998; Atkins et al., 1999). Studying software systems and development processes using these sources of data offers several advantages (Cook et al., 1998): It is very cost-effective, as no additional instrumentation is necessary, and does not influence the process under consideration. In addition, longitudinal data are available, allowing for analyses that consider the whole project history. Depending on the tools used in a project, possible repositories available for analysis include source code versioning systems (Atkins et al., 1999; Kemerer and Slaughter, 1999), bug reporting systems, and mailing lists. In open source projects, repositories in several forms are also in use, and in fact form the most important communication and coordination channels as participants are not collocated. Therefore, there is only a small amount of information that cannot be captured because it is transmitted inter-personally. Repositories in use must also be available openly, in order to enable persons to access them and participate in the project.

Prior studies have included both in-depth analyses of small numbers of successful projects (Gallivan, 2001) like Apache and Mozilla (Mockus et al., 2002), GNOME (Koch and Schneider, 2002) or FreeBSD (Dinh-Tong and Bieman, 2005), and also large data samples, such of those derived from Sourceforge.net (Koch, 2004; Long and Siau, 2007). Primarily, information provided by version control systems has been used, but also aggregated data provided by software repositories (Crowston and Scozzi, 2002; Hunt and Johnson, 2002; Krishnamurthy, 2002), meta-information included in Linux Software Map entries (Dempsey et al., 2002), or data retrieved directly from the source code itself (Ghosh and Prakash, 2000).

Although this data is available, the task is made more complicated by the large size and scope of the project repositories or code forges, and the heterogeneity of the projects being studied (Howison and Crowston, 2004; Robles et al., 2009). Therefore, in the last years RoRs (“repository of repositories”) have been developed, which collect, aggregate, and clean the targeted repository data (Sowe et al., 2007). Two examples are FLOSSMetrics and FLOSSmole. These RoRs usually hold data collected from project repositories, and some of them also store some analysis and metrics calculated on the retrieved data. The results (raw data, summary data, and / or analyses) will be stored in a database and accessible to the rest of the research community. The researcher therefore does not need to collect data independently. 

3. OPEN SOURCE SOFTWARE DEVELOPMENT PROJECTS

Open source software development in many ways constitutes a new production mode, in which people are no longer collocated, and self-organization is prevalent. While there is one seminal description of the bazaar style of development by Raymond (1999), it should be noted that open source projects do differ significantly in the processes they employ (Scacchi et al., 2006), and that reality has been found to differ from this very theoretical description.

For example, there are strict release processes in place in several open source projects (Jorgensen, 2001; Holck and Jorgensen, 2004), and a considerable level of commercial involvement. Several ways have been discussed to describe different open source development processes, e.g. Crowston et al. (2006) operationalize a process characteristic based on the speed of bug fixing, Michlmayr (2005) used a construct of process maturity, while also concentration indices have been used to characterize development forms (Koch and Neumann, 2008). We find that there is considerable variance in the practices actually employed, as well as the technical infrastructure. The research on similarities and dissimilarities between open source development in general and other development models is still proceeding (Mockus et al., 2002; Koch, 2004; Scacchi et al., 2006), and remains a hotly debated issue (Bollinger et al., 1999; McConnell, 1999; Vixie, 1999).

Numerous quantitative studies of development projects and communities (Dempsey et al., 2002; Dinh-Trong and Bieman, 2005; Ghosh and Prakash, 2000; Koch and Schneider, 2002; Koch, 2004; Krishnamurthy, 2002; Mockus et al., 2002) have proposed process metrics like the commit, which refers to a single change of a file by a single programmer, or the number of distinct programmers involved in writing and maintaining a file or project to study open source work practices. One of the most consistent results coming out of this research is a heavily skewed distribution of effort between participants (Koch, 2004; Mockus et al., 2002; Ghosh and Prakash, 2000; Dinh-Tong and Bieman, 2005). Several studies have adopted the normalized Gini coefficient (Robles et al., 2004), a measure of concentration, for this. The Gini coefficient is a number between 0 and 1, where 0 is an indicator for perfect equality and 1 for total inequality or concentration, and can be based both on commits or lines-of-code contributed, with studies showing no major difference. For example, Mockus et al. (2002) have shown that the top 15 of nearly 400 programmers in the Apache project added 88 per cent of the total lines-of-code. In the GNOME project, the top 15 out of 301 programmers were only responsible for 48 percent, while the top 52 persons were necessary to reach 80 per cent (Koch and Schneider 2002), with clustering hinting at the existence of a still smaller group of 11 programmers within this larger group. A similar distribution for the lines-of-code contributed to the project was found in a community of Linux kernel developers by Hertel et al. (2003). Also the results of the Orbiten Free Software survey (Ghosh and Prakash 2000) are similar, the first decile of programmers was responsible for 72 per cent, the second for 9 per cent of the total code.

A second major result regarding organization of work is a low number of people working together on file level. For example, Koch and Neumann (2008) have found that only 12.2% of the files have more than three distinct authors. Most of the files have one (24.0%) or two (56.1%) programmers and only 3% have more than five distinct authors, in accordance with other studies on file or project level (Koch, 2004; Krishnamurthy, 2002; Mockus et al., 2002; Ghosh and Prakash, 2000).

Similar distribution can also be found on project level in large scale studies: For example, Koch (2004) in his study of several thousand projects found a vast majority of projects having only a very small number of programmers (67.5 per cent have only 1 programmer). Only 1.3 per cent had more than 10 programmers. Analyzing the 100 most active mature projects on Sourceforge.net, Krishnamurthy (2002) also showed that most of the projects had only a small number of participants (median of 4). Only 19 per cent had more than 10, 22 per cent only 1 developer. While this percentage is much smaller than found by Koch (2004), this is not surprising as Krishnamurthy only used the 100 most active projects, not the full population. 

4. EFFORT OF OPEN SOURCE SOFTWARE DEVELOPMENT PROJECTS

4.1 EFFORT ESTIMATION

Any discussion on open source projects, their success, efficiency or quality is incomplete without addressing the topic of effort, and relating other constructs to it. Unfortunately, this effort is basically unknown, even to the leaders of these projects, and therefore needs to be estimated. In addition such an estimation offers important insights to stakeholders, e.g. in the context of decisions about how an ongoing project might be managed, whether to join or to remain in a project, as well as adoption decisions by current or prospective users, including companies deciding whether to pursue a related business model.

Software engineering research for many years has focused on the topic of effort estimation, and has produced numerous models and methods. The best known of these is probably COCOMO (Boehm, 1981), which offers an algorithmic formula for estimating effort based on a quantification of the lines of code; this was modified and updated with the publication of COCOMO II (Boehm et al., 2000). Other options for effort estimation include software equation (Putnam, 1978) approaches based on the function point metric (Albrecht and Gaffney, 1983), diverse machine-learning approaches and proprietary models such as ESTIMACS and SLIM. Many of these approaches are based on the general development project formulation introduced by Norden (1960), which develops a manpower function based on the number of people participating in a project at a given time. Differences in work organization in open source projects raise the question of whether participation in OS projects can be modeled and predicted using approaches created in the context of traditional software development, or whether new models need to be developed.

In general, we can employ two different approaches to estimating effort, one based on output, i.e. software and some measure of its size, the other based on evidence of participation. Especially the comparison of both approaches can yield important insights related to comparing open source to commercial software development, as the estimation based on size basically assumes an organization equivalent to commercial settings. In this paper, we will mostly focus on the results of this process, while for a more in-depth coverage, the reader is referred to Koch (2008), where the analysis is based on a set of more than 8000 projects. Other related work is limited (Koch and Schneider, 2002; Koch, 2004; Gonzalez-Barahona et al., 2004; Wheeler, 2005; Amor et al., 2006), and for the most part applies only basic models without further discussion, or indirect effort measures (Yu, 2006).

For participation-based estimation, the basis is formed by the work of Norden (1960), which models any development project as a series of problem-solving efforts by the manpower involved. The manpower involved at each moment in an open source project can be inferred from an analysis of the source code management system logs. The number of people usefully employed at any given time is assumed to be approximately proportional to the number of problems ready for solution at that time. The manpower function then represents a Rayleigh-type curve. While Norden postulates a finite and fixed number of problems, additional requests would lead to the generation of new problems to be worked on. While this effect might be small to negligible until the time of operation, it might be a driving factor later on. Therefore, while there are similarities in the early stages of a project, in the later stages, distinct differences in processes and organization of work show up, linked to differences in goal setting and eliciting. We will therefore explore this possible effect of work organization in the next section.

The first model in the output-based estimation category is the original COCOMO (Boehm, 1981), and while severe problems related to use of this model due to violated assumptions exist, it is still employed for comparison with other models and with existing studies (Gonzalez-Barahona et al., 2004; Wheeler, 2005). The necessary data can easily be gathered from the source code management system of any open source project, or by downloading the source code itself and submitting it to a counting program. The updated version COCOMO 2 (Boehm et al., 2000) eliminates some of these concerns, so forms another option. An approach that is similar to COCOMO in that it is also based on an output metric, is the function point method (Albrecht and Gaffney, 1983). This method in general offers several advantages, most importantly the possibility to quantify the function points relatively early, based on analysis and design. In estimating the effort it is difficult, especially for an outsider, to correctly quantify the function points, even after delivery. Another way of arriving at a number is to use the converse method of converting the function point count to lines-of-code (Albrecht and Gaffney, 1983; Boehm et al., 2000). The literature proposes a mean number of lines-of-code required to implement a single function point in a given programming language. Once the amount of function points for a given system is known, literature provides several equations, basically production functions, to relate this amount to effort, naturally all based on data collected in a commercial software development environment. Examples include Albrecht and Gaffney (1983), Kemerer (1987) or Matson et al. (1994), who propose both a linear and logarithmic model.

4.2 ORGANIZATION AND EFFORT

We first explore the differences that the organization of work in open source projects mean for manpower modeling based on the work of Norden (1960). As discussed, while Norden postulates a finite and fixed number of problems, additional requests from the user and developer community could lead to the generation of new problems to be worked on. Based on extensive modeling and comparisons using several alternative manpower functions for a set of more than 8000 projects (Koch, 2008), modified Norden-Rayleigh-functions incorporating this effect significantly outperform the classical variant over complete project lifespans. We can see this as proof that in the open source form of organization, additional requirements and functionalities are introduced to a higher degree than in commercial settings. Also several possible forms for adding this effect to the Norden-Rayleigh model have been explored (Koch, 2008): The features added seem to depend on the starting problem, more than on the cumulative effort expended up to that time. This highlights the importance of the first requirements, often the vision of the project founder, in determining future enhancements. Also, a different proportionality factor, i.e. learning rate, has to be assumed as compared to the main respectively initial project, with a quadratic function better suited to modeling the addition of new problems. These results underline how much open source software development is actually driven by the participants and users, who truly shape the direction of such a project according to their needs or ideas.

We next turn to an analysis of the differences between participation-based estimation and output-based estimation. As the approaches in the latter group are all based on data from commercial settings, these differences can give an idea about the relation between participation-based effort, and the effort that would be necessary to develop the same system in a different environment. The empirical analysis (Koch, 2008) shows distinctive differences: Estimates derived from Norden-Rayleigh modeling were tested against each output-based method, and were significantly lower. An analytical comparison is also possible between COCOMO in both versions (Boehm, 1981; Boehm et al., 2000), and the Norden-Rayleigh model because COCOMO is based on this. Londeix (1987) detailed how the Rayleigh-curve corresponding to a given COCOMO estimation can be determined. In this case the other direction is employed to find a parameter set in COCOMO corresponding to the Rayleigh-curve, derived from programmer participation. As COCOMO offers both development mode and a number of cost drivers as parameters, there is no single solution. Nevertheless, actually no solution is possible given the parameter space, so open source development cannot be modeled using the original COCOMO, which leads to the conclusion that development has been more efficient than theoretically possible. When the possible parameters of COCOMO II are explored, the result is once again that this project is very efficient as both cost drivers and scale factors replacing the modes of development in COCOMO II have to be rated rather favorably, but this time the resulting combinations are within the range.

These differences showing up between the effort estimates based on participation and output might be due to several reasons. First, open source development organization might constitute a more efficient way of producing software (Bollinger et al., 1999; McConnell, 1999), due mostly to self-selection outperforming management intervention. Participants might be able to determine more accurately whether and where they are able to work productively on the project overall, or on particular tasks. In addition, overhead costs are very much reduced. The second explanation might be that the difference is caused by non-programmer participation, i.e. people participating by discussing on mailing lists, reporting bugs, maintaining web sites and the like. These are not included in the participation-based manpower modeling due to the fact that data is based on the source code management systems. If the Norden-Rayleigh and COCOMO estimates are compared, COCOMO results for effort are eight times higher. If it were assumed that this difference could only come from the invisible effort expended by these participants, this effort must be enormous. It would account for about 88 percent of the effort, translating to about 7 persons assisting each programmer. As has been shown, in open source projects, the number of participants other than programmers is about one order of magnitude larger than the number of programmers (Dinh-Trong and Bieman, 2005; Mockus et al., 2002; Koch and Schneider, 2002), but their expended effort is implicitly assumed to be much smaller. We are therefore going to relate this to the the ‘chief programmer team organization’, proposed more than 30 years ago (Mills, 1971; Baker, 1972). This has also been termed the ‘surgical team’ by Brooks (1995), and is a form of organization where which system development is divided into tasks each handled by a chief programmer who has responsibility for the most part of the actual design and coding, supported by a larger number of other specialists such as documentation writers or testers. 

5. EFFICIENCY OF OPEN SOURCE SOFTWARE DEVELOPMENT PROJECTS

5.1EFFICIENCY AND SUCCESS

Before the issue of effects on the efficiency of open source projects from the organization of work is explored, a short discussion on conceptualization of efficiency as well as success of open source projects is necessary. These topics, especially success, are more difficult to define and grasp than in commercial software development. Consequently, there is increased discussion on how the success of open source projects can be defined (Stewart, 2004; Stewart et al., 2006; Stewart and Gosain, 2006; Crowston et al., 2003; Crowston et al., 2004; Crowston et al., 2006), using, for example, search engine results as proxies (Weiss, 2005), or measures like number of downloads achieved. Over time, research has indicated in this way several possible success measures, but aggregating those to have an overall picture has been a major problem. For example, Crowston et al. (2006) present more than 15 success measures, Stewart and Gosain (2006) also included subjective success measures from a survey. This leads to most studies choosing a different set of success measures, and for the most part not aggregating the chosen set of different measures.

Efficiency and productivity in software development is most often denoted by the relation of an effort measure to an output measure, using either lines-of-code or, preferably due to independence from programming language, function points (Albrecht and Gaffney, 1983). This approach can be problematic even in an environment of commercial software development due to missing components especially of the output, for example also Kitchenham and Mendes (2004) agree that productivity measures need to be based on multiple size measures.

As discussed, the effort invested is normally unknown and consequently needs to be estimated, and the participants are also more diverse than in commercial projects as they include core team member, committers, bug reporters and several other groups with varying intensity of participation. Besides, also the outputs can be more diverse. In the general case, the inputs of an open source project can encompass a set of metrics, especially concerned with the participants. In the most simple cases the number of programmers and other participants can be used. The output of a project can be measured using several software metrics, most easily the lines-of-code, files, or others. This range of metrics both for inputs and outputs, and their different scales necessitates the application of a more sophisticated and appropriate method. Many of the results presented in the next two chapters are therefore based on applying Data Envelopment Analysis (DEA) to this problem. DEA (Farell, 1957; Charnes et al., 1978; Banker et al., 1984) is a non-parametric optimization method for efficiency comparisons without any need to define any relations between different factors or a production function. In addition, DEA can account for economies or dis-economies of scale, and is able to deal with multi-input, multi-output systems in which the factors have different scales.

The main result of applying DEA for a set of projects is an efficiency score for each project. This score can serve different purposes: First, single projects can be compared accordingly, but also groups of projects, for example those following similar process models, located in different application domains or simply of different scale can be compared to determine whether any of these characteristics lead to higher efficiency. 

5.2 ORGANIZATION AND EFFICIENCY

In this section, we will give an overview of the interrelationship between different attributes of open source projects characterizing their organization of work, as well as infrastructure, and their efficiency.

The first element to be explored naturally is the generally large number of participants. Following the reasoning of Brooks, an increased number of people working together will decrease productivity due to exponentially increasing communication costs (Brooks, 1995). Interestingly, this effect has not turned up in prior studies (Koch, 2004; Koch, 2007). This leads to the interesting conclusion that Brooks’s Law seemingly does not apply to open source software development. There are several possible explanations for this, which include the very strict modularization, which increases possible division of labor while reducing the need for communication. Also the low number of programmers working together on single files can be taken as a hint for this. We will also explore the notion of superior tool and infrastructure use as a possible factor later.

As empirical results have shown that the effort within open source projects is distributed very inequally, which seems to be a major characteristic of this type of organization, any effects this could have on efficiency should also be explored. Using a data set of projects from SourceForge and DEA, Koch (2008a) showed that there is indeed no connection: There was no significant difference in efficiency between projects with different levels of inequality, so this form of organization does not seem to incur a penalty. In some works, also license is hypothesized as having an impact on success or efficiency. Subramanian et al. (2009) found such an effect, as did Stewart et al. (2006), while Koch (2008a) did not.

Finally, the infrastructure employed for communication and coordination naturally shapes the work done in a project. It has been hypothesized that the advent of the Internet and especially the coordination and communications tools are at least a precondition for open source development (Raymond, 1999; Rusovan et al., 2005; Robbins, 2005). For example, Michlmayr (2005) has used a sample of projects to uncover whether the process maturity, based on version control, mailing lists and testing strategies, has had any influence on the success of open source projects, and could confirm this. Koch (2009) has analyzed the impact of adoption of different tools offered by SourceForge as well as tool diversity on project efficiency using DEA, and found surprising results: In a data set of successful projects, actually negative influences of tool adoption were found, while the results were more positive in a random data set. Two explanations were proposed for this, with one being that projects, especially larger ones, might be using other tools. The second explanation is that tools for communicating with users and potential co-developers can become more of a hindrance in successful projects, as they could increase the load to a degree that it detracts attention and time from the developers, which might be better spent on actual development work. In general, the successful projects also show a more progressed status, so actually these results seem to correspond to the results of Stewart and Gosain (2006), who stress the importance of development stage as moderator in project performance. In addition, these projects in general have a higher number of developers, which, counter-intuitively, seems to be linked to negative effects of communication and coordination tool adoption. One explanation might be that projects with problems in communication and coordination due to team size adopt tools to a higher degree, which can not completely solve the problem after it has passed some threshold. Therefore projects adopting these tools have a lower efficiency, but that might be even lower without tool adoption. The same reasoning could apply for communication channels with users: Tool adoption alone might be unable to prevent total communication overload.

5.3 ORGANIZATION AND QUALITY

In addition to effects of organization on efficiency, quality is a major concern in software development, and a hugely debated topic in open source (Dinh-Trong and Bieman, 2005; Stamelos et al., 2002; Zhao and Elbaum, 2000). We will therefore highlight a few results which link elements of organization to the quality achieved, although related studies are quite rare. For capturing this, attributes of the development process as used before need to be related to characteristics of quality for which diverse metrics from software engineering like McCabe's cyclomatic complexity (McCabe, 1976) or Chidamber and Kemerer's object-oriented metrics suite (Chidamber and Kemerer, 1994) can be employed. Koch and Neumann (2008) have attempted such an analysis using Java frameworks, and found that a high number of programmers and commits, as well as a high concentration is associated with problems in quality on class level, mostly to violations of size and design guidelines. This underlines the results of Koru and Tian (2005), who have found that modules with many changes rate quite high on structural measures like size or inheritance. If the architecture is not modular enough, a high concentration might show up as a result of this, as it can preclude more diverse participation. The other explanation is that classes that are programmed and/or maintained by a small core team are more complex due to the fact that these programmers ‘know’ their own code and don't see the need for splitting large and complex methods. One possibility in this case is a refactoring (Fowler, 1999) for a more modular architecture with smaller classes and more pronounced use of inheritance. This would increase the possible participation, thus maybe in turn leading to lower concentration, and maintainability together with other quality aspects. Underlining these results, MacCormack et al. (2006) have in a similar study used design structure matrices to study the difference between open source and proprietary developed software, without further discrimination in development practices. They find significant differences between Linux, which is more modular, and the first version of Mozilla. The evolution of Mozilla then shows purposeful redesign aiming for a more modular architecture, which resulted in modularity even higher than Linux. They conclude that a product's design mirrors the organization developing it, in that a product developed by a distributed team such as Linux was more modular compared to Mozilla developed by a collocated team. Alternatively, the design also reflects purposeful choices made by the developers based on contextual challenges, in that Mozilla was successfully redesigned for higher modularity at a later stage. On project level, there is a distinct difference: Those projects with high overall quality ranking have more authors and commits, but a smaller concentration than those ranking poorly. Thus, on class level a negative impact of more programmers was found, while on project level a positive effect. This underlines a central statement of open source software development on a general level, that as many people as possible should be attracted to a project. On the other hand, these resources should, from the viewpoint of product quality, be organized in small teams. Ideally, on both levels, the effort is not concentrated on too few of the relevant participants. Again, this seems to point to the organizational form of ‘chief programmer team organization’ (Mills, 1971; Baker, 1972; Brooks, 1995).

6. CONCLUSIONS

In this paper, we have surveyed the available literature related to characteristics of open source software development projects, and implications for effort and quality. Most of the empirical works have been based on mining the associated software repositories, and the results show that this is a promising way of achieving insights into projects and their characteristics.

One of the main principles and results found is the high concentration of programming work on a small number of individuals, which seems to hold true for most projects, similar to a very skewed distribution between projects. In addition, the number of people working on files cooperatively is quite small, with commercial involvement even increasing this trend. Under some circumstances, this high concentration can be linked to problems in quality and maintainability. The number of programmers attracted to a project forms a main focus point, and generally has positive implications: Having more programmers, quite interestingly and contrary to software engineering theory, does not reduce productivity, and does not negatively affect quality, if the concentration is kept in check. When considering the effort, estimations based on programmer involvement are significantly below estimations based on project output. Besides high efficiency due to self-organization and absence of management overhead, this points to an enormous effort expended by non-programming participants. Many of these results point to one especially important characteristic that determines success of an open source projects, which is modularity. A modular architecture allows for high participation while avoiding the problems of high concentration on a lower level like file or class. 

There are many avenues for future research which are open in the context of open source software development, and a few have been touched upon here. The topic of effort and effort estimation is far from closed yet, and especially participants other than programmers are not adequately reflected here. Also the relations between projects, for example inclusion of results or reuse, and linking this to efficiency, effort and organization, similar to a market versus hierarchy discussion would be highly interesting. There are numerous works that have used social network analysis, e.g. by Grewal et al. (2006) or Oh and Jeon (2007), both between as well as within projects, and this could be an interesting lens through which to inspect the chief programmer team aspect discussed here. Also Dalle and David (2005) have started to analyze the allocation of resources in projects. Finally, the definition of success for open source projects is still difficult, and this uncertainty undermines some of the results from other studies or aspects.

REFERENCES
 
Albrecht, A. J., & Gaffney, J. E. (1983). Software Function, Source Lines of Code, and Development Effort Prediction: A Software Science Validation. IEEE Transactions on Software Engineering, 9(6), pp. 639-648.

Amor, J. J., Robles, G., & Gonzalez-Barahona, J. M. (2006). Effort estimation by characterizing developer activity. In Proc. of the 2006 International Workshop on Economics Driven Software Engineering Research. International Conference on Software Engineering, Shanghai, China, pp. 3-6.

Atkins, D., Ball, T., Graves, T., & Mockus, A. (1999). Using Version Control Data to Evaluate the Impact of Software Tools. In Proc. 21st International Conference on Software Engineering. Los Angeles, CA, pp. 324-333.

Baker, F. T. (1972) Chief Programmer Team Management of Production Programming. IBM Systems Journal, 11(1), pp. 56-73.

Banker, R.D., Charnes, A., & Cooper, W. (1984). Some Models for Estimating Technical and Scale Inefficiencies in Data Envelopment Analysis. Management Science, 30, 1078-1092.

Basset, T. (2004). Coordination and social structures in an open source project: Videolan. In Koch, S., (eds.), Open Source Software Development, pages 125-151. Hershey, PA: Idea Group Publishing.

Boehm, B.W. (1981). Software Engineering Economics. Englewood Cliffs, NJ: Prentice Hall.

Boehm, B. W., Abts, C., Brown, A. W., Chulani, S., Clark, B. K., Horowitz, E., Madachy, R., Reifer, D. J., & Steece, B. (2000). Software Cost Estimation with COCOMO II. Upper Saddle River, NJ: Prentice Hall.

Bollinger, T., Nelson, R., Self, K. M., & Turnbull, S. J. (1999). Open-source methods: Peering through the clutter. IEEE Software, 16(4), 8-11.

Brooks Jr., F. P. (1995). The Mythical Man-Month: Essays on Software Engineering. Anniversary ed., Reading, MA: Addison-Wesley.

Charnes, A., Cooper, W., & Rhodes, E. (1978). Measuring the Efficiency of Decision Making Units. European Journal of Operational Research, 2, 429-444.

Chidamber, S. and Kemerer, C. F. (1994). A metrics suite for object oriented design. IEEE Transactions on Software Engineering, 20(6), 476-493.

Coleman, E. G., & Hill, B. (2004). The social production of ethics in debian and free software communities: Anthropological lessons for vocational ethics. In Koch, S., (ed.), Open Source Software Development, pages 273-295. Hershey, PA: Idea Group Publishing.

Cook, J. E., Votta, L. G., & Wolf, A. L. (1998). Cost-effective analysis of in-place software processes. IEEE Transactions on Software Engineering, 24(8), pp. 650-663.

Crowston, K., & Scozzi, B. (2002). Open source software projects as virtual organizations: Competency rallying for software development. IEE Proceedings - Software Engineering, 149(1), 3-17.

Crowston, K., Annabi, H., & Howison, J. (2003). Defining Open Source Software Project Success. In Proceedings of ICIS 2003, Seattle, WA.

Crowston, K., Annabi, H., Howison, J., & Masango, C. (2004). Towards A Portfolio of FLOSS Project Success Measures. In Collaboration, Conflict and Control: The 4th Workshop on Open Source Software Engineering (ICSE 2004), Edinburgh, Scotland.

Crowston, K., Howison, J., & Annabi, H. (2006). Information systems success in free and open source software development: theory and measures. Software Process: Improvement and Practice, 11(2), 123-148.

Crowston, K., Li, Q., Wei, K., Eseryel, Y., & Howison, J. (2007). Self-organization of teams for free/libre open source software development. Information and Software Technology, 49, 564-575.

Dalle, J.-M., & David, P.A. (2005). The Allocation of Software Development ressources in Open Source Production Mode. In Feller, J., Fitzgerald, B., Hissam, S. A., & Lakhani, K. R., editors, Perspectives on Free and Open Source Software, pages 297-328. Cambridge, MA: MIT Press.

Dempsey, B. J., Weiss, D., Jones, P., & Greenberg, J.(2002). Who is an open source software developer? CACM, 45(2), 67-72.

Dinh-Trong, T.T., & Bieman, J. M. (2005). The FreeBSD Project: A Replication Case Study of Open Source Development. IEEE Transactions on Software Engineering, 31(6), 481-494.

Elliott, M. S., & Scacchi, W. (2004). Free software development: Cooperation and conflict in a virtual organizational culture. In Koch, S. (ed.), Open Source Software Development, pages 152-172. Hershey, PA: Idea Group Publishing.

Farell, M.J. (1957). The Measurement of Productive Efficiency. Journal of the Royal Statistical Society, Series A 120(3), pp. 250-290.

Fitzgerald, Brian (2006). The Transformation of Open Source Software. MIS Quarterly, 30(3), 587-598.

Gallivan, M. J. (2002). Striking a balance between trust and control in a virtual organization: A content analysis of open source software case studies. Information Systems Journal, 11(4), 277-304.

Ghosh, R., & Prakash, V. V. (2000). The Orbiten Free Software Survey. First Monday, 5(7).

Gonzalez-Barahona, J.M., Robles, G., Ortuno Perez, M., Rodero-Merino, L., Centeno-Gonzalez, J., Matellan-Olivera, V., Castro-Barbero, E., & de-las Heras-Quiros, P (2004). Analyzing the anatomy of GNU/Linux distributions: methodology and case studies (Red Hat and Debian). In Koch, S. (Ed.), Free/Open Source Software Development. Hershey, PA: Idea Group Publishing.

Grewal, R., Lilien, G. L., & Mallapragada, G. (2006). Location, Location, Location: How Network Embeddedness Affects Project Success in Open Source Systems. Management Science, 52(7), 1043-1056.

Hertel, G., Niedner, S., & Hermann, S. (2003). Motivation of software developers in open source projects: An internet-based survey of contributors to the Linux kernel. Research Policy, 32(7), 1159-1177.

Holck, J., & Jorgensen, N. (2004). Do not check in on red: Control meets anarchy in two open source projects. In Koch, S., editor, Free/Open Source Software Development, pages 1–26. Hershey, PA: Idea Group Publishing.

Howison, J., & Crowston, K. (2004). The perils and pitfalls of mining SourceForge. In Proc. of the International Workshop on Mining Software Repositories. Edinburgh, Scotland, pp. 7-11.

Hunt, F., & Johnson, P. (2002). On the pareto distribution of sourceforge projects. In: Proc. Open Source Software Development Workshop. Newcastle, UK, (pp. 122-129).

Jorgensen, N. (2001). Putting it all in the trunk: Incremental software engineering in the FreeBSD Open Source project. Information Systems Journal, 11(4), 321–336.

Kemerer, C. F., (1987). An Empirical Validation of Software Cost Estimation Models. CACM, 30(5), pp. 416-429.

Kemerer, C. F., & Slaughter, S. (1999). An Empirical Approach to Studying Software Evolution. IEEE Transactions on Software Engineering, 25(4), 493-509.

Kitchenham, B., & Mendes, E. (2004). Software Productivity Measurement Using Multiple Size Measures. IEEE Transactions on Software Engineering, 30(12), 1023-1035.

Koch, S. (2004). Profiling an open source project ecology and its programmers. Electronic Markets, 14(2), 77-88.

Koch, S. (2007). Software Evolution in Open Source Projects - A Large-Scale Investigation. Journal of Software Maintenance and Evolution, 19(6), 361-382.

Koch, S. (2008). Effort Modeling and Programmer Participation in Open Source Software Projects. Information Economics and Policy, 20(4), 345-355. 

Koch, S. (2008a). Measuring the Efficiency of Free and Open Source Software Projects Using Data Envelopment Analysis. In Sowe, S.K., Stamelos, I., & Samoladas, I. (eds.): Emerging Free and Open Source Software Practices, pp. 25-44, Hershey, PA: IGI Publishing.

Koch, S. (2009). Exploring the Effects of SourceForge.net Coordination and Communication Tools on the Efficiency of Open Source Projects using Data Envelopment Analysis. Empirical Software Engineering, 14(4), 397-417. 

Koch, S., & Neumann, C. (2008). Exploring the Effects of Process Characteristics on Product Quality in Open Source Software Development. Journal of Database Management, 19(2), 31-57.

Koch, S., & Schneider, G. (2002). Effort, Cooperation and Coordination in an Open Source Software Project: Gnome. Information Systems Journal, 12(1), 27-42.

Krishnamurthy, S. (2002). Cave or community? an empirical investigation of 100 mature open source projects. First Monday, 7(6).

Londeix, B. (1987). Cost Estimation for Software Development. Addison-Wesley, Wokingham, UK.

Long, Y., & Siau, K. (2007). Social Network Structures in Open Source Software Development Teams. Journal of Database Management, 18(2), 25-40.

MacCormack, A., Rusnak, J., & Baldwin, C.Y. (2006). Exploring the Structure of Complex Software Designs: An Empirical Study of Open Source and Proprietary Code. Management Science, 52(7), 1015-1030.

Matson, J. E., Barrett, B. E., & Mellichamp, J. M. (1994). Software Development Cost Estimation Using Function Points. IEEE Transactions on Software Engineering, 20(4), 275-287.

McCabe, T. (1976). A complexity measure. IEEE Transactions on Software Engineering, 2(4), 308-320.

McConnell, S. (1999). Open-source methodology: Ready for prime time? IEEE Software, 16(4), 6-8.

Michlmayr, M. (2005). Software Process Maturity and the Success of Free Software Projects. In Zielinski, K. and Szmuc, T. (eds.): Software Engineering: Evolution and Emerging Technologies, pp. 3-14, Amsterdam, The Netherlands: IOS Press.

Mills, H. D. (1971). Chief Programmer Teams: Principles and Procedures. Report FSC 71-5108. IBM Federal Systems Division.

Mockus, A., Fielding, R., & Herbsleb, J. (2002). Two case studies of open source software development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology, 11(3), 309-346.

Norden, P. V. (1960). On the anatomy of development projects. IRE Transactions on Engineering Management, 7(1), 34-42.

Oh, W., & Jean, S. (2007). Membership Herding and Network Stability in  the Open Source Community: The Ising Perspective. Management Science, 53(7), 1086-1101.

Perens, B. (1999). The Open Source Definition’, in DiBona, C. et al. (eds.), Open Sources: Voices from the Open Source Revolution, Cambridge, MA: O’Reilly & Associates.

Putnam, L. H. (1978). A general empirical solution to the macro software sizing and estimating problem, IEEE Transactions on Software Engineering, 4(4), 345-361.

Raymond, E.S. (1999). The Cathedral and the Bazaar. Cambridge, MA: O’Reilly & Associates.

Robbins, J. (2005). Adopting Open Source Software Engineering (OSSE) Practices by Adopting OSSE Tools. In Feller, J., Fitzgerald, B., Hissam, S.A., & Lakhani, K.R. (eds), Perspectives on Free and Open Source Software, Cambridge, MA: MIT Press.

Robles, G., Koch, S., & Gonzalez-Barahona, J. M. (2004). Remote analysis and measurement of libre software systems by means of the CVSanalY tool. In ICSE 2004 - Proceedings of the Second International Workshop on Remote Analysis and Measurement of Software Systems (RAMSS ’04), pages 51–55, Edinburgh, Scotland.

Robles, G., Gonzalez-Barahona, J.M., & Merelo, J.J. (2006). Beyond source code: The importance of other artifacts in software development (a case study). Journal of Systems and Software, 79(9), 1233-1248.

Robles, G., González-Barahona, J. M., Izquierdo-Cortazar, D., & Herraiz, I. (2009). Tools for the study of the usual data sources found in libre software projects. International Journal of Open Source Software & Processes, 1(1), 24-45.

Rusovan, S., Lawford, M., & Parnas, D.L. (2005). Open Source Software Development: Future or Fad? In Feller, J., Fitzgerald, B., Hissam, S.A., & Lakhani, K.R. (eds), Perspectives on Free and Open Source Software, Cambridge, MA: MIT Press.

Scacchi, W., Feller, J., Fitzgerald, B., Hissam, S., & Lakhani, K. (2006). Understanding Free/Open Source Software Development Processes. Software Process: Improvement and Practice, 11(2), 95–105.

Sowe, S. K., Angelis, L., Stamelos, I., & Manolopoulos, Y. (2007) Using Repository of Repositories (RoRs) to Study the Growth of F/OSS Projects: A Meta-Analysis Research Approach. OSS 2007, pp. 147-160.

Stallman, Richard M. (2002). Free Software, Free Society: Selected Essays of Richard M. Stallman. Boston, Massachusetts: GNU Press.

Stamelos, I., Angelis, L., Oikonomou, A., & Bleris, G.L. (2002). Code quality analysis in open source software development. Information Systems Journal, 12, 43-60.

Stewart, K.J. (2004). OSS Project Success: From Internal Dynamics to External Impact. In Collaboration, Conflict and Control: The 4th Workshop on Open Source Software Engineering (ICSE 2004), Edinburgh, Scotland.

Stewart, K.J., Ammeter, A.P., & Maruping, L.M. (2006). Impacts of Licence Choice and Organisational Sponsorship on User Interest and Development Activity in Open Source Software Projects. Information Systems Research, 17(2), 126-144.

Stewart, K. J., & Gosain, S. (2006). The Moderating Role of Development Stage in Affecting Free/Open Source Software Project Performance. Software Process:  Improvement and Practice, 11(2), 177-191.

Subraminian, C., Sen, R. and Nelson, M.L. (2009). Determinants of open source software project success: A longitudinal study. Decision Support Systems, 46, pp. 576-585.

Vixie, P., 1999. Software Engineering. In DiBona, C., Ockman, S., Stone, M. (Eds.), Open Sources: Voices from the Open Source Revolution. Cambridge, MA: O’Reilly & Associates.

von Krogh, G., & Spaeth, S. (2007). The open source software phenomenon: Characteristics that promote research. Journal of Strategic Information Systems, 16(3), 236–253.

Weiss, D. (2005). Measuring Success of Open Source Projects Using Web Search Engines. In Proceedings of the 1st International Conference on Open Source Systems, pp. 93-99, Genoa, Italy.

Wheeler, D. A. (2005). More Than a Gigabuck: Estimating GNU/Linux's Size - Version 1.07 (updated 2002). Retrieved May 21, 2010, from http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.html

Yu, L. (2006). Indirectly predicting the maintenance effort of open-source software. Journal of Software Maintenance and Evolution, 18(3), 311-332.

Zhao, L., Elbaum, S. (2000). A survey on quality related activities in open source. Software Engineering Notes, 25(3), 54-57.

Author(s)/Editor(s) Biography

Stefan Koch is Professor and Chair at Bogazici University, Department of Management. His research interests include user innovation, cost estimation for software projects, the open source development model, the evaluation of benefits from information systems, and ERP systems. He has published over 20 papers in peer-reviewed journals, including Information Systems Journal, Information Economics and Policy, Decision Support Systems, Empirical Software Engineering, Electronic Markets, Information Systems Management, Journal of Database Management, Journal of Software Maintenance and Evolution, Enterprise Information Systems and Wirtschaftsinformatik, and over 30 in international conference proceedings and book collections. He has also edited a book titled Free/Open Source Software Development for an international publisher in 2004, and serves as Editor-in-Chief of the International Journal on Open Source Software and Processes.