Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Open source project data is not open data

Open source project data is not open data

Open source works in the open, that means git log, email archives, etc. are open data, right? WRONG!

This talk walks through the legal and technical challenges for analyzing open source projects. Solutions and best practices will be shared. The audience is expected to have experience with open source projects and to have an interest in learning more about analyzing those projects. This can include maintainers, community managers, organizational open source program officers, academics, or data scientists with an interest in data from open source.



March 07, 2020


  1. @GeorgLink @GeorgLink Open source project data is not open data

    Georg Link SCALE 18x 6pm, room 104 March 5-8, 2020, Pasadena, CA
  2. @GeorgLink @GeorgLink Today’s Topic: Analyzing Communities • Legal Challenges •

    Technical Challenges • Metric Best Practices
  3. @GeorgLink @GeorgLink About Georg Link Omaha, NE Ph.D. on Open

    Source Metrics Cofounder of CHAOSS Project Bitergia
  4. @GeorgLink I am not the expert

  5. @GeorgLink Thought Experiment: Contributor Imagine you are an open source

    contributor. You engage in an open source community to get work done (or other reasons). How would you react to the following scenarios? • The community creates contributor profiles that show how much everyone contributed. • The community regularly recognizes the most active contributors. • A dashboard shows how the level of contributions evolves over time to show how strong the community is.
  6. @GeorgLink Thought Experiment: Company Now imagine you work at a

    company and developed a piece of software. Your company agrees to your proposal to open source the software but requires you to show the impact that this has for the company. How would you approach the following: • Ensure that a vibrant community forms around the software project. • Identify areas that need attention in the project. • Report progress to your manager.
  7. @GeorgLink The role of Data In the thought experiments: What

    data did we talk about? Who were the data producers? Who were the data users?
  8. @GeorgLink The role of Data In the thought experiments: What

    was the relationship between data, its producers, and its users?
  9. @GeorgLink Community Data as Trace Data Created accidentally Contributions and

    their metadata Metadata typically has no license Contains personal identifiable information (PII) like names and emails
  10. @GeorgLink Is community data open data? Open data is data

    that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike. -- https://opendatahandbook.org/
  11. @GeorgLink @GeorgLink Legal Challenges

  12. @GeorgLink General Data Protection Regulation (GDPR) https://gdpr-info.eu/

  13. @GeorgLink A gist of GDPR Art 4. (1) ‘personal data’

    means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
  14. @GeorgLink A gist of GDPR Art 4. (2) ‘processing’ means

    any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;
  15. @GeorgLink A gist of GDPR Art 4. (7) ‘controller’ means

    the natural or legal person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data; where the purposes and means of such processing are determined by Union or Member State law, the controller or the specific criteria for its nomination may be provided for by Union or Member State law;
  16. @GeorgLink A gist of GDPR Art 4. (11) ‘consent’ of

    the data subject means any freely given, specific, informed and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her;
  17. @GeorgLink GDPR: Rights of the data subject (Chapter 3 excerpt)

    Section 1 -- Transparency and modalities Section 2 -- Information and access to personal data Section 3 -- Rectification and erasure • “Right to be forgotten” • Restriction of Processing • Notification obligation
  18. @GeorgLink GDPR grey area: Open Source Was created for business

    and government organizations that collect data. Open source communities collect data for their own purpose, but not in the same sense that organizations do. Personal data is freely given to an open source project.
  19. @GeorgLink Prior Informed Consent in Open Source GDPR is agnostic

    to the fact that data is already public! Two options for consent: • OPT-IN ◦ safe option • OPT-OUT ◦ requires demonstrating “legitimate interest”
  20. @GeorgLink What we need to do Inform community members before

    processing data • How can we reach all community members and document that you informed them? Obtain consent • Provide means to OPT-IN or OPT-OUT
  21. @GeorgLink CCPA: California Consumer Privacy Act 1798.140. (c) “business” means:

    ... • An open source project may not qualify as a business
  22. @GeorgLink Inform community before processing data Demonstrate “legitimate interest” Provide

    OPT-IN or OPT-OUT Be transparent and communicative Recap: Legal Challenges
  23. @GeorgLink @GeorgLink Technical Challenges

  24. @GeorgLink Where is the data? Where is the community? Git

    GitHub GitLab BitBucket Jira Gerrit Confluence … Wiki Discourse Mailing List IRC Slack Meetup.com StackOverflow …
  25. @GeorgLink @GeorgLink How to get the data Extract: • Get

    data from data sources Transform: • Unify data • Manage identities • Calculate metrics Load: • Visualize
  26. @GeorgLink Extracting data Easiest step Query logs Use APIs Challenge:

    Changing APIs and data formats
  27. @GeorgLink Transforming data Unify data • Date formats • Level

    of detail • Metadata about different contributions • Convert everything into the desired database structure Manage identities Calculate metrics
  28. @GeorgLink Transforming data Unify data Manage identities • Who is

    who in the community • Who do contributors work for (now and before) • Different usernames and email • Assigning contributions to the correct person Calculate metrics
  29. @GeorgLink Transforming data Unify data Manage identities Calculate metrics •

    Primary metrics - summarizing original data • Secondary metrics ◦ Calculation from different data fields ◦ Combining data from different data sources ◦ Value judgements on data - e.g., quality models
  30. @GeorgLink Loading data Who is the data user? How should

    the data be presented? What visualizations are most meaningful? What story does the data tell?
  31. @GeorgLink @GeorgLink Technical Solutions

  32. @GeorgLink Software solutions CHAOSS GrimoireLab CHAOSS Augur Apache Kibble CNCF

    DevStats GrimoireLab
  33. @GeorgLink GrimoireLab Tools for Each Challenge Transform Extract Load Manage

    Identities GrimoireLab
  34. @GeorgLink Extract with GrimoireLab Tool: Perceval GrimoireLab

  35. @GeorgLink Transform with GrimoireLab Tool: GrimoireELK • Unify data •

    Calculate metrics GrimoireLab
  36. @GeorgLink Transform with GrimoireLab Tool: SortingHat • Manage identities GrimoireLab

  37. @GeorgLink Load with GrimoireLab Tool: Kibiter (Kibana) GrimoireLab

  38. @GeorgLink GrimoireLab

  39. @GeorgLink GrimoireLab

  40. @GeorgLink GrimoireLab

  41. @GeorgLink GrimoireLab

  42. @GeorgLink GrimoireLab

  43. @GeorgLink @GeorgLink Too complicated? Try Cauldron.io

  44. @GeorgLink Cauldron.io

  45. @GeorgLink Cauldron.io

  46. @GeorgLink Cauldron.io

  47. @GeorgLink Cauldron.io 1 2

  48. @GeorgLink gitlab.com/jsmanrique/cauldron-dashboards CHAOSS metrics Prebuilt dashboard

  49. @GeorgLink https://community.cauldron.io/c/docs Best place to find help with getting started

  50. @GeorgLink @GeorgLink Metric Best Practices

  51. @GeorgLink About CHAOSS Short for: Community Health Analytics Open Source

    Software Linux Foundation project Started in 2017 Focused on “creating analytics and metrics to help define community health” -- https://chaoss.community
  52. @GeorgLink The Idea Behind CHAOSS Metrics Software

  53. @GeorgLink Defining Metrics in CHAOSS 5 Working Groups: • Common

    Metrics • Diversity and Inclusion • Evolution • Risk • Value https://chaoss.community/metrics
  54. @GeorgLink Goal-Question-Metric Approach Goal Question Question Metric Metric Metric Metric

  55. @GeorgLink Metric Best Practices Follow Goal-Question-Metric approach Use metrics to

    tell a story Evaluate the usefulness of metric strategy Minimize gaming of metrics Start small and “get off zero”
  56. @GeorgLink @GeorgLink Recap Legal Challenges Technical Challenges Metric Best Practices

  57. @GeorgLink @GeorgLink LinkedIn/GitHub/Twitter @georglink Email georglink@bitergia.com Q&A