Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
GitHub Insights: Understanding Open Source
Search
Georgios Gousios
May 19, 2016
Technology
0
380
GitHub Insights: Understanding Open Source
Talk given at OSCON 2016
Georgios Gousios
May 19, 2016
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
300
The troubles of modern dependency management and what to do about them
gousiosg
0
560
Mining Repositories with Apache Spark
gousiosg
0
670
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
940
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
300
The #issue32 incident
gousiosg
2
16k
Other Decks in Technology
See All in Technology
なぜ新機能リリース翌日にモニタリング可能なのか? 〜リードタイム短縮とリソース問題を「自走」で改善した話〜 / data_summit_findy_Session_2
sansan_randd
1
130
Oracle Cloud Infrastructure:2025年10月度サービス・アップデート
oracle4engineer
PRO
0
110
どうなる Remix 3
tanakahisateru
1
270
ubuntu-latest から ubuntu-slim へ移行しよう!コスト削減うれしい~!
asumikam
0
260
決済システムの信頼性を支える技術と運用の実践
ykagano
0
130
Playwrightで始めるUI自動テスト入門
devops_vtj
0
180
Design and implementation of "Markdown to Google Slides" / phpconfuk 2025
k1low
1
260
Copilotの精度を上げる!カスタムプロンプト入門.pdf
ismk
9
1.8k
3年ぶりの re:Invent 今年の意気込みと前回の振り返り
kazzpapa3
0
110
よくわからない人向けの IAM Identity Center とちょっとした落とし穴
kazzpapa3
2
290
データとAIで明らかになる、私たちの課題 ~Snowflake MCP,Salesforce MCPに触れて~ / Data and AI Insights
kaonavi
0
340
AI時代におけるドメイン駆動設計 入門 / Introduction to Domain-Driven Design in the AI Era
fendo181
0
270
Featured
See All Featured
Designing for Performance
lara
610
69k
jQuery: Nuts, Bolts and Bling
dougneiner
65
7.9k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Mobile First: as difficult as doing things right
swwweet
225
10k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
658
61k
Raft: Consensus for Rubyists
vanstee
140
7.2k
Building Applications with DynamoDB
mza
96
6.7k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Building Adaptive Systems
keathley
44
2.8k
Making the Leap to Tech Lead
cromwellryan
135
9.6k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1k
Facilitating Awesome Meetings
lara
57
6.6k
Transcript
GitHub Insights Understanding Open Source @jeffmcaffer–Microsoft Georgios Gousios –Delft University
of Technology (TU Delft) Kevin Lewis – Microsoft
Snapshot overview
Inspire confidence
How open is a project? http://ghtorrent.org/pullreq-perf/
Commits (core vs community)
Commits (origin)
Comments (core vs community)
PR lifelines
Are we using git in a distributed way?
How may devs are there per country?
Insights
Business insights
Research insights
Cross-domain insights
Operational insights
Approach Data for the masses
GitHub by the numbers (Mid 2016)
Approach http://ghtorrent.org
How does it work? http://api.github.com/events
Example event (condensed) https://api.github.com/users/Cephei https://api.github.com/repos/PowerDMS/Owin.Scim https://api.github.com/repos/PowerDMS/Owin.Scim/commits/c751014f634d73e0b72f78a53c8cf137888b3 https://api.github.com/orgs/PowerDMS
Entities
GHTorrent architecture Github API Event Retrieval Commits Queue Project Events
Queue Events Data Retrieval Projects Commits evt.commit evt.watch evt.fork Data Retrieval Data Retrieval Data Retrieval Mirroring Cluster
GHTorrent by the numbers
Using the data You can do it too!
Using the data: Hosted http://ghtorrent.org
Using the data: Download
Using the data: Self-service https://github.com/ghtorrent/ghtorrent-webhook
Using the data: Azure Data Lake
Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights @gousiosg @jeffmcaffer @kelewis