Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
GitHub Insights: Understanding Open Source
Search
Georgios Gousios
May 19, 2016
Technology
0
380
GitHub Insights: Understanding Open Source
Talk given at OSCON 2016
Georgios Gousios
May 19, 2016
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
300
The troubles of modern dependency management and what to do about them
gousiosg
0
550
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
940
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
300
The #issue32 incident
gousiosg
2
16k
Other Decks in Technology
See All in Technology
可観測性は開発環境から、開発環境にもオブザーバビリティ導入のススメ
layerx
PRO
4
1.6k
JSConf JPのwebsiteをGatsbyからNext.jsに移行した話 - Next.jsの多言語静的サイトと課題
leko
2
190
AI機能プロジェクト炎上の 3つのしくじりと学び
nakawai
0
130
OSSで50の競合と戦うためにやったこと
yamadashy
3
1k
入院医療費算定業務をAIで支援する:包括医療費支払い制度とDPCコーディング (公開版)
hagino3000
0
120
オブザーバビリティと育てた ID管理・認証認可基盤の歩み / The Journey of an ID Management, Authentication, and Authorization Platform Nurtured with Observability
kaminashi
1
1.1k
プロダクト開発と社内データ活用での、BI×AIの現在地 / Data_Findy
sansan_randd
1
580
組織全員で向き合うAI Readyなデータ利活用
gappy50
4
1.3k
AWS re:Invent 2025事前勉強会資料 / AWS re:Invent 2025 pre study meetup
kinunori
0
680
IoTLT@ストラタシスジャパン_20251021
norioikedo
0
140
abema-trace-sampling-observability-cost-optimization
tetsuya28
0
290
もう外には出ない。より快適なフルリモート環境を目指して
mottyzzz
13
11k
Featured
See All Featured
Typedesign – Prime Four
hannesfritz
42
2.8k
Embracing the Ebb and Flow
colly
88
4.9k
The Art of Programming - Codeland 2020
erikaheidi
56
14k
The Language of Interfaces
destraynor
162
25k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.6k
Thoughts on Productivity
jonyablonski
70
4.9k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
658
61k
Large-scale JavaScript Application Architecture
addyosmani
514
110k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.1k
Building Applications with DynamoDB
mza
96
6.7k
It's Worth the Effort
3n
187
28k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
9
930
Transcript
GitHub Insights Understanding Open Source @jeffmcaffer–Microsoft Georgios Gousios –Delft University
of Technology (TU Delft) Kevin Lewis – Microsoft
Snapshot overview
Inspire confidence
How open is a project? http://ghtorrent.org/pullreq-perf/
Commits (core vs community)
Commits (origin)
Comments (core vs community)
PR lifelines
Are we using git in a distributed way?
How may devs are there per country?
Insights
Business insights
Research insights
Cross-domain insights
Operational insights
Approach Data for the masses
GitHub by the numbers (Mid 2016)
Approach http://ghtorrent.org
How does it work? http://api.github.com/events
Example event (condensed) https://api.github.com/users/Cephei https://api.github.com/repos/PowerDMS/Owin.Scim https://api.github.com/repos/PowerDMS/Owin.Scim/commits/c751014f634d73e0b72f78a53c8cf137888b3 https://api.github.com/orgs/PowerDMS
Entities
GHTorrent architecture Github API Event Retrieval Commits Queue Project Events
Queue Events Data Retrieval Projects Commits evt.commit evt.watch evt.fork Data Retrieval Data Retrieval Data Retrieval Mirroring Cluster
GHTorrent by the numbers
Using the data You can do it too!
Using the data: Hosted http://ghtorrent.org
Using the data: Download
Using the data: Self-service https://github.com/ghtorrent/ghtorrent-webhook
Using the data: Azure Data Lake
Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights @gousiosg @jeffmcaffer @kelewis