Slide 1

Slide 1 text

Mining GitHub for fun and profit Georgios Gousios // @gousiosg TU Delft

Slide 2

Slide 2 text

api.github.com Entities Events •static view •interlinked •current state •dynamic view •generated by user actions •affect current entity state •can be browsing roots

Slide 3

Slide 3 text

WatchEvent PushEvent ForkEvent . . . CreateEvent Events

Slide 4

Slide 4 text

WatchEvent PushEvent ForkEvent . . . CreateEvent {{ "type": "WatchEvent", "payload": {...}, "public": true, "repo": {...}, "created_at": "2012-05-28T12:42 "id": "1556481024", "actor": {"login": "Sarukhan"} } Events

Slide 5

Slide 5 text

repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { "type": "User", "public_gists": 10, "login": "gousiosg", "followers": 64, "name": "Georgios Gousios", "public_repos": 20, "created_at": ..., "id": 386172, "following": 16, } { Entities . . .

Slide 6

Slide 6 text

<> PushEvent

Slide 7

Slide 7 text

<> PushEvent <> /:user/:repo/sha ensure_commit

Slide 8

Slide 8 text

<> PushEvent <> /repos/:user/:repo/ ensure_repo <> /:user/:repo/sha ensure_commit

Slide 9

Slide 9 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /:user/:repo/sha ensure_commit

Slide 10

Slide 10 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /repos/:user/:repo/commits ensure_commits <> /:user/:repo/sha ensure_commit

Slide 11

Slide 11 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /repos/:user/:repo/commits ensure_commits <> /:user/:repo/sha ensure_commit <> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments

Slide 12

Slide 12 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /repos/:user/:repo/commits ensure_commits <> /:user/:repo/sha ensure_commit <> /users/:user/ followers ensure_followers <> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments

Slide 13

Slide 13 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /repos/:user/:repo/commits ensure_commits <> /:user/:repo/sha ensure_commit <> /users/:user/ followers ensure_followers <> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <> /users/:user/orgs ensure_orgs

Slide 14

Slide 14 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /repos/:user/:repo/commits ensure_commits <> /:user/:repo/sha ensure_commit <> /users/:user/ followers ensure_followers <> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <> /users/:user/orgs ensure_orgs <> /orgs/:org/teams ensure_teams

Slide 15

Slide 15 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /repos/:user/:repo/commits ensure_commits ensure_user <> /:user/:repo/sha ensure_commit <> /users/:user/ followers ensure_followers <> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <> /users/:user/orgs ensure_orgs <> /orgs/:org/teams ensure_teams

Slide 16

Slide 16 text

<> PushEvent <> /users/:user ensure_user <> /repos/:user/:repo/ ensure_repo <> /repos/:user/:repo/commits ensure_commits ensure_user <> /:user/:repo/sha ensure_commit ensure_user <> /users/:user/ followers ensure_followers <> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <> /users/:user/orgs ensure_orgs <> /orgs/:org/teams ensure_teams

Slide 17

Slide 17 text

Github API Event Retrieval Commits Queue Project Events Queue Events Data Retrieval Projects Commits evt.commit evt.watch evt.fork Data Retrieval Data Retrieval Data Retrieval Mirroring Cluster Distributed data processing

Slide 18

Slide 18 text

Relational database for querying

Slide 19

Slide 19 text

repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { "type": "User", "public_gists": 10, "login": "gousiosg", "followers": 64, "name": "Georgios Gousios", "public_repos": 20, "created_at": ..., "id": 386172, "following": 16, } { . . . MongoDB as query-able cache

Slide 20

Slide 20 text

Open from the beginning

Slide 21

Slide 21 text

Periodic dumps of DBs

Slide 22

Slide 22 text

Query relational DB online

Slide 23

Slide 23 text

Query MySQL programmatically

Slide 24

Slide 24 text

Query MongoDB programmatically

Slide 25

Slide 25 text

Streaming updates

Slide 26

Slide 26 text

Real time analytics

Slide 27

Slide 27 text

User geolocation

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

Full language retrieval

Slide 30

Slide 30 text

Roll your own dataset $ gem install sqlite3 bundler $ git clone https://github.com/gousiosg/github-mirror $ cd github-mirror $ bundle install $ mv config.yaml.standalone > config.yaml $ ruby -Ilib bin/ght-retrieve-repo -t token rails rails

Slide 31

Slide 31 text

Statistics Since Feb 2012 12TB in MongoDB 4.5B rows in MySQL 2GB per hour 120k API reqs/hour 46 user donated API keys G. Gousios, “The GHTorrent dataset and tool suite,” in MSR, 2013: 233-236 130+ users, 80+ institutions 80+ papers 3 data mining challenges 2 best paper awards G. Gousios and D. Spinellis, “ GHTorrent: GitHub’s Data from a Firehose,” in MSR, 2012, 12–21

Slide 32

Slide 32 text

MongoDB MySQL Diff 2016/2013 Events 476 11.1x Users 6,7 9,2 8.4x Repos 28 25,5 21.8x Commits 367 362 12.3x Issues 24,1 25,3 10.3x Pull requests 11,9 11,1 9.7x Issue comments 42 43 14.6x Watchers 51 37 6.6x G. Gousios, The Evolution of GHTorrent: Growing an Open Access Dataset 10x Growth

Slide 33

Slide 33 text

1000 10000 100000 2012 2013 2014 2015 2016 Date Number of events Event Type CommitCommentEvent FollowEvent ForkEvent IssueCommentEvent IssuesEvent MemberEvent PullRequestEvent PullRequestReviewCommentEvent PushEvent TeamAddEvent WatchEvent G. Gousios, The Evolution of GHTorrent: Growing an Open Access Dataset 10x

Slide 34

Slide 34 text

0 100000 200000 300000 2012 2013 2014 2015 2016 Date Number of events Event Type CommitCommentEvent FollowEvent ForkEvent IssueCommentEvent IssuesEvent MemberEvent PullRequestEvent PullRequestReviewCommentEvent PushEvent TeamAddEvent WatchEvent G. Gousios, The Evolution of GHTorrent: Growing an Open Access Dataset 10x

Slide 35

Slide 35 text

https://octodex.github.com

Slide 36

Slide 36 text

https://octodex.github.com

Slide 37

Slide 37 text

Here are my changes Please fix those issues Here are my updates Looks great, thanks! contributor integrator changes integrated changes examined changes re- examined

Slide 38

Slide 38 text

Across 5k repos/ ~1M pull requests 85% merged, 70% with merge button 80% < 150 lines, < 7 files, 3 commits 66% < 1 day to merge 80% 4 comments, 3 participants Mostly rejected due to observability/awareness issues (not technical!) G. Gousios, M. Pinzger, and A. van Deursen, “An Exploratory Study of the Pull-based Software Development Model,” ICSE 2014, pp. 345–355 What factors affect PR acceptance?

Slide 39

Slide 39 text

Which factors affect PR acceptance? ?

Slide 40

Slide 40 text

G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect PR acceptance? Do we know the submitter? Can we handle the workload? How ready is our project for PRs? What does the PR look like?

Slide 41

Slide 41 text

G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect PR acceptance? Do we know the submitter? Can we handle the workload? How ready is our project for PRs? Do we know the submitter? What does the PR look like?

Slide 42

Slide 42 text

G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect PR acceptance? Do we know the submitter? Can we handle the workload? How ready is our project for PRs? Can we handle the workload? What does the PR look like?

Slide 43

Slide 43 text

G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect PR acceptance? Do we know the submitter? Can we handle the workload? How ready is our project for PRs? What does the PR look like? What does the PR look like?

Slide 44

Slide 44 text

G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect PR acceptance? Do we know the submitter? Can we handle the workload? How ready is our project for PRs? What does the PR look like? How ready is our project for PRs?

Slide 45

Slide 45 text

Can we handle the workload? G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect the time to process PRs? Do we know the submitter? How ready is our project for PRs? What does the PR look like?

Slide 46

Slide 46 text

Can we handle the workload? G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect the time to process PRs? Do we know the submitter? How ready is our project for PRs? What does the PR look like? Can we handle the workload?

Slide 47

Slide 47 text

Can we handle the workload? G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect the time to process PRs? Do we know the submitter? How ready is our project for PRs? What does the PR look like? Do we know the submitter?

Slide 48

Slide 48 text

Can we handle the workload? G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect the time to process PRs? Do we know the submitter? How ready is our project for PRs? What does the PR look like? How ready is our project for PRs?

Slide 49

Slide 49 text

Can we handle the workload? G. Gousios et al., “Distributed software development with pull requests”. Unpublished. Which factors affect the time to process PRs? Do we know the submitter? How ready is our project for PRs? What does the PR look like? What does the PR look like?

Slide 50

Slide 50 text

Survey of 650 integrators Generally, few complains about the process Points of pain are mostly social (workload, drive-by PRs, explaining rejection) Needed tools Quality analysis Impact analysis Work prioritization G. Gousios, A. Zaidman, M.-A. Storey, and A. van Deursen, “Work Practices and Challenges in Pull-Based Development: The Integrator’s Perspective,” ICSE 2015, pp. 358–368. What do integrators actually believe?

Slide 51

Slide 51 text

Survey of 640 contributors. Similar issues, reversed Awareness Asynchrony Responsiveness G. Gousios, M.-A. Storey, and A. Bacchelli, “Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective,” ICSE, 2016. What do contributors actually believe?

Slide 52

Slide 52 text

Decide change Code/Check quality Submit Discuss intentions Feedback Code review Submit Proposed fixes Discuss fixes Code review Accept Code/Check quality contributors integrators

Slide 53

Slide 53 text

Decide change Code/Check quality Submit Discuss intentions Feedback Code review Submit Proposed fixes Discuss fixes Code review Accept Code/Check quality contributors integrators

Slide 54

Slide 54 text

quality

Slide 55

Slide 55 text

quality lack of process

Slide 56

Slide 56 text

quality lack of process workload and responsiveness

Slide 57

Slide 57 text

quality lack of process workload and responsiveness communication

Slide 58

Slide 58 text

      IMPORTANT EVERYTHING ELSE

Slide 59

Slide 59 text

Which are ?  IMPORTANT                   

Slide 60

Slide 60 text

Which are ?  IMPORTANT                    Day 1 Day 2 Day 3 Day 4

Slide 61

Slide 61 text

Which are ?  IMPORTANT             Day 1 Day 2 Day 3 Day 4

Slide 62

Slide 62 text

Which are ?  IMPORTANT             Day 1 Day 2 Day 3 Day 4

Slide 63

Slide 63 text

Which are ?  IMPORTANT             Day 1 Day 2 Day 3 Day 4 

Slide 64

Slide 64 text

Which are ?  IMPORTANT             Day 1 Day 2 Day 3 Day 4  

Slide 65

Slide 65 text

Which are ?  IMPORTANT                  Day 1 Day 2 Day 3 Day 4  

Slide 66

Slide 66 text

Which are ?  IMPORTANT                  Day 1 Day 2 Day 3 Day 4 

Slide 67

Slide 67 text

Which are ?  IMPORTANT                  Day 1 Day 2 Day 3 Day 4  

Slide 68

Slide 68 text

Which are ?  IMPORTANT                    Day 1 Day 2 Day 3 Day 4  

Slide 69

Slide 69 text

Which are ?  IMPORTANT                    Day 1 Day 2 Day 3 Day 4   IMPORTANT = about to be active

Slide 70

Slide 70 text

Precision Recall AUC Accuracy Random Forests 0.66 0.63 0.89 0.86 Naive Bayes 0.34 0.79 0.75 0.60 Logistic regression 0.36 0.84 0.81 0.62 E. van der Veen, G. Gousios, and A. Zaidman, “Automatically Prioritizing Pull Requests,” MSR, 2015, pp. 357–361.

Slide 71

Slide 71 text

Precision Recall AUC Accuracy Random Forests 0.66 0.63 0.89 0.86 Naive Bayes 0.34 0.79 0.75 0.60 Logistic regression 0.36 0.84 0.81 0.62 E. van der Veen, G. Gousios, and A. Zaidman, “Automatically Prioritizing Pull Requests,” MSR, 2015, pp. 357–361.

Slide 72

Slide 72 text

Pourquoi

Slide 73

Slide 73 text

Openess perf reports How open is your project to community contributions? • 5k projects • every 15 days http://ghtorrent.org/pullreq-perf

Slide 74

Slide 74 text

Reviewer recommendation Yue Yu, Huaimin Wang, Gang Yin, Tao Wang, Reviewer recommendation for pull- requests in GitHub: What can we learn from code review and bug assignment?,IST, 2016 Exploit @mention networks to propose top-3 reviewers for incoming pull requests. Accuracy ~60% on top-3 recommendation

Slide 75

Slide 75 text

Automated code review Vincent J. Hellendoorn, Premkumar T. Devanbu, and Alberto Bacchelli. 2015. Will they like this?: evaluating code contributions with language models. MSR ’15, pp157-167 Examine how “natural” the PR code is WRT the project’s code base. Accepted PRs are significantly similar to the project More debated PRs are significantly less similar

Slide 76

Slide 76 text

Gender and Tenure Vasilescu, Posnett, Ray, van den Brand, Serebrenik, Devanbu, and Filkov. Gender and Tenure Diversity in GitHub Teams. CHI. 2015 “Our study suggests that, overall, when forming or recruiting a software team, increased gender and tenure diversity are associated with greater productivity.”

Slide 77

Slide 77 text

Geographical equality A. Rastogi, N. Nagappan, and G. Gousios, “All contributors are equal; some contributors are more equal than others,”, TR, 2016 Traces of bias on contributions from certain countries • Contributors perceive it • Integrator’s do not

Slide 78

Slide 78 text

Gender & Contributions Terrell J, Kofink A, Middleton J, Rainear C, Murphy-Hill E, Parnin C. (2016) Gender bias in open source: Pull request acceptance of women versus men. PeerJ PrePrints 4:e1733v1 When gender is identifiable: women rejected more often When gender is not identifiable: women accepted more often

Slide 79

Slide 79 text

The #issue32 incident

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

I am not a lawyer! • Other commenters are no lawyers either • The law is complicated and open to interpretation

Slide 83

Slide 83 text

Two important issues • Copyright: Who owns the data? • Privacy: How does GHTorrent protect users from personal data misuse?

Slide 84

Slide 84 text

Copyright —General terms • For original content, the publisher maintains full copyright by default • Licenses restrict the effect of copyright • Events (e.g. the fact that an issue comment was created) are not copyrightable, but their content may be

Slide 85

Slide 85 text

Copyright — GitHub’s POV • GitHub: We claim no intellectual property rights over the material you provide to the Service. (TOS F.1) • Structure of API responses is GitHub’s IP • Several fields in API responses may contain copyrighted material

Slide 86

Slide 86 text

Copyright situation example { "id": "4141500869", "type": "IssueCommentEvent", "actor": {}, "repo": {}, "payload": { "action": "created", "issue": { "id": 158442053, "number": 138, "title": "Issue in CopyrightedProjectName", "user": {}, "labels": [], "state": "closed", "body": "Added data holding classes and a map manager. Will add a system soon" }, "comment": { "created_at": "2016-06-14T05:51:16Z", "updated_at": "2016-06-14T05:51:16Z", "body": "continuing in #141 \r\n" } } }

Slide 87

Slide 87 text

Copyright situation example { "id": "4141500869", "type": "IssueCommentEvent", "actor": {}, "repo": {}, "payload": { "action": "created", "issue": { "id": 158442053, "number": 138, "title": "Issue in CopyrightedProjectName", "user": {}, "labels": [], "state": "closed", "body": "Added data holding classes and a map manager. Will add a system soon" }, "comment": { "created_at": "2016-06-14T05:51:16Z", "updated_at": "2016-06-14T05:51:16Z", "body": "continuing in #141 \r\n" } } } ©Issue initiator

Slide 88

Slide 88 text

Copyright situation example { "id": "4141500869", "type": "IssueCommentEvent", "actor": {}, "repo": {}, "payload": { "action": "created", "issue": { "id": 158442053, "number": 138, "title": "Issue in CopyrightedProjectName", "user": {}, "labels": [], "state": "closed", "body": "Added data holding classes and a map manager. Will add a system soon" }, "comment": { "created_at": "2016-06-14T05:51:16Z", "updated_at": "2016-06-14T05:51:16Z", "body": "continuing in #141 \r\n" } } } ©Issue initiator ©Issue commenter

Slide 89

Slide 89 text

Copyright situation example { "id": "4141500869", "type": "IssueCommentEvent", "actor": {}, "repo": {}, "payload": { "action": "created", "issue": { "id": 158442053, "number": 138, "title": "Issue in CopyrightedProjectName", "user": {}, "labels": [], "state": "closed", "body": "Added data holding classes and a map manager. Will add a system soon" }, "comment": { "created_at": "2016-06-14T05:51:16Z", "updated_at": "2016-06-14T05:51:16Z", "body": "continuing in #141 \r\n" } } } ©Project Name ©Issue initiator ©Issue commenter

Slide 90

Slide 90 text

Copyright situation example { "id": "4141500869", "type": "IssueCommentEvent", "actor": {}, "repo": {}, "payload": { "action": "created", "issue": { "id": 158442053, "number": 138, "title": "Issue in CopyrightedProjectName", "user": {}, "labels": [], "state": "closed", "body": "Added data holding classes and a map manager. Will add a system soon" }, "comment": { "created_at": "2016-06-14T05:51:16Z", "updated_at": "2016-06-14T05:51:16Z", "body": "continuing in #141 \r\n" } } } ©GitHub ©Project Name ©Issue initiator ©Issue commenter

Slide 91

Slide 91 text

Privacy is the ability of an individual or group to seclude themselves, or information about themselves, and thereby express themselves selectively.

Slide 92

Slide 92 text

Privacy provisions — EU • Personal data identify a person uniquely • Facts are not personal data • GHTorrent processes personal data, therefore is a controller • Controllers must • get consent for processing (except in the case of legitimate interest) • include mechanisms for opting out

Slide 93

Slide 93 text

Privacy provisions — USA • No single law/directive • Consent only required for specific types of data storage (e.g. social security numbers) • Offering an opting out mechanism

Slide 94

Slide 94 text

What did GHTorrent do? • Stopped distributing user names and emails in MySQL data dumps • Researchers can “sign” a form to get access to private data • Created an opt-out process • In the process of creating Terms of Fair Use

Slide 95

Slide 95 text

A question of research ethics Can we, in the name of science, • send emails to developers? • create developer profiles? • recommend work to developers? • rank developers based on contributions? • compare project characteristics? • characterise community practices?

Slide 96

Slide 96 text

 http://ghtorrent.org