Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Determining the Geographical Distribution of a ...

Bitergia
August 18, 2016

Determining the Geographical Distribution of a Community by Means of a Time Zone Analysis

Presented at OpenSym 2016

Bitergia

August 18, 2016
Tweet

More Decks by Bitergia

Other Decks in Technology

Transcript

  1. Determining the Geographical Distribution of a Community by Means of

    a Time-zone Analysis Jesus M. Gonzalez-Barahona Gregorio Robles Daniel Izquierdo-Cortazar OpenSym FLOSS Track, Berlin (Germany), Aug 17-19, 2016
  2. Tracking FLOSS communities FLOSS is developed by geographically distributed teams

    In many cases, those can be large: > 1,000 developers Extended communities much larger: 10,000s and more Increased interest in knowing and tracking those communities Software development dashboards
  3. Geographical information Interesting to many communities Useful to set up

    meetings, conference locations, … Fundamental to track inclusion policies Example (Mozilla's mission): “people worldwide can be informed contributors and creators of the Web”
  4. Obtaining geolocation is not easy Usually poor data if it

    depends on developers registering (eg: localization of GitHub accounts) Surveying & friends is expensive and intrusive GeoIP is usually not possible Summarizing: better to use already available information
  5. Our approach: already available data Some tools capture some geodata

    of a person “by default” git: commit af4145239a05eb3b9dbf198a43e06d8a6acd0195 Author: Jesus M. Gonzalez-Barahona <[email protected]> Date: Sun Aug 14 01:00:51 2016 +0200 Email: From: "Jesus M. Gonzalez-Barahona" <[email protected]> To: ... Date: Fri, 12 Aug 2016 18:59:38 +0200 Time zone is not perfect, but may be a good enough indicator
  6. Methodology Identification of data sources (git, mailing lists repositories) Extraction

    of information (CVSAnalY, MLStats) Analysis and curation of the dataset Timezone analysis (GrimoireLib)
  7. Practical issues Continuously update data Extract relevant data (eg, discard

    merge commits) Identify real contributors (eg, discard bots) Represent timezone uniformly (eg: IST +5:30)
  8. Case study: Apache CloudStack Deployment & management of large collections

    of virtual machines Git repos active since August 2010 Git activity 2010-2015: 40,000 commits, 300 devels Mailing lists for development activities
  9. CloudStack: some findings The project started with devels mainly in

    US West Coast Evolved towards devels in both US coasts, Europe and India Data for commits and git authors seem reasonable Data for mailing lists probably false for UTC+0 Mailing list activity high in East Asia (not in git data)
  10. Conclusions Timezone analysis allows for determination of main geographical areas

    for a community It is non-intrusive, uses already public data It is subject to data-quality problems The methodology can be fully automated
  11. Limitations Summer / Winter timezones mangle results Timezone are large

    (eg, both Europe and Africa in the same) Detection of “true” UTC+0 causes problems Sometimes the data is not available in archives (eg, some mailing lists)
  12. Determining the Geographical Distribution of a Community by Means of

    a Time-zone Analysis Jesus M. Gonzalez-Barahona Gregorio Robles Daniel Izquierdo-Cortazar Methodology to learn about the geographical distribution of a FLOSS project. Based on already available data, non-intrusive, can be fully automated. Case study: CloudStack