Appendix 1: estimating the number of active FLOSS projects

From SME Guide

Jump to: navigation, search

A recurring debate discussion among FLOSS-supporters and detractors is related to the estimation of the real number of active FLOSS projects. While it is easy to look at the main repository site (sourceforge.net) that boasts more than 100.000 projects, it is equally easy to look in more depth and realize that a significant number of those projects are really abandoned or have no significant development.

For the purpose of obtaining some unbiased estimates, we performed a first search among the main repository sites and FLOSS announce portals; we also set a strict activity requirement, stately an activity index from 80 to 100% and at least a file release in the last 6 months. Of the overall 155959 projects, only 10656 (6.8%) are "active" (with a somehow very restrictive definition; a more relaxed release period of 1 year shows an active percentage of 9.2% or 14455 projects).

However, while Sourceforge can rightly be considered the largest single repository, it is not the only potential source of projects; there are many other vertical repositories, among them BerliOS, Savannah, Gna! and many others, derived both from the original version of the Sourceforge code and many more based on a rewritten version called GForge.[1]

The result summary is:

Repository name Number of projects
All GForge sites[2] 16776
Berlios Sourcewell 3340
Savannah 2793
Gna! 1039

That gives a total of 23948 projects, to which (using a sampling of 100 projects from each) we have found a similar number of active projects (between 8% and 10%).

The next step is the estimation of how many projects of the overall FLOSS landscape are hosted on those sites, and for performing this estimate we took the entire FreshMeat[3] announce database, as processed by the FLOSSmole project[4] and found that the projects that have an homepage in one of the repository sites are 23% of the total. This count is however biased by the fact that the probability of a project to be announced on FreshMeat is not equal for all projects; that is, english-based and oriented towards a large audience have a much higher probability to be listed. To take this into account, we performed a search for non-english based forges, and for software that is oriented towards a very specific area, using data from past IST projects like Spirit and AMOS. We have found that non-english projects are underrepresented in FreshMeat in a significant way, but as the overall "business-readiness" of those projects is unclear (as for example there may be no translations available, or be specific to a single country legal environment) we have ignored them. Vertical projects are also underrepresented, especially with regard to projects in scientific and technical areas, where the probability of being included is around 10 times lower compared to other kind of software. By using the results from Spirit, a sampling from project announcements in scientific mailing lists, and some repositories for the largest or more visible projects (like the CRAN archive, that hosts libraries and packages for the R language for statistics, that hosts 1195 projects) we have reached a lower bound estimate of around 12000 "vertical" and industry-specific projects.

So, we have an overall lower bound estimate of around 195000 projects, of which we can estimate that 7% are active, leading to around 13000 active projects. Of those, we can estimate (using data from Slashdot, FreshMeat and the largest Gforge sites) that 36% fall in the "stable" or "mature" stage, leading to a total of around 5000 projects that can be considered suitable for an SME, that is with an active community, stable and with recent releases.

It should be considered that this number is a lower bound, obtained with slightly severe assumptions; also, this estimate does not try to assess the number of projects not listed in the announcement sites (even vertical application portals); this is a deliberate action, as it would be difficult to estimate the reliability of such a measure, and because the "findability" of a project and its probability of having a sustained community participation are lower if it is difficult to find information on the project in the first place; this means that the probability of such "out of the bounds" projects would probably be not a good opportunity for SME adoption in any case.


  1. It has been suggested to the authors that in this way we can end up counting twice those projects that move from one site to others. The reality is that as the "old" project becomes inactive, it is removed from the count and so this risk is limited to those that performed the move in the last 12 months only (as moving is rather uncommon, this is however a very small number that should not influence the overall percentages).
  2. As reported in the GForge site count, http://gforge.org/docman/view.php/1/52/gforge-sites.html
  3. A popular FLOSS announcement portal. www.freshmeat.net
  4. a collaborative collection and analysis of FLOSS data, http://ossmole.sourceforge.net/
Personal tools