Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
GitHub Insights: Understanding Open Source
Search
Georgios Gousios
May 19, 2016
Technology
450
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
GitHub Insights: Understanding Open Source
Talk given at OSCON 2016
Georgios Gousios
May 19, 2016
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
380
The troubles of modern dependency management and what to do about them
gousiosg
0
710
Mining Repositories with Apache Spark
gousiosg
0
740
My adventures with open everything
gousiosg
0
370
Structure and Evolution of Package Dependency Networks
gousiosg
0
930
Mining Github for fun and profit
gousiosg
9
63k
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
990
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
350
The #issue32 incident
gousiosg
2
16k
Other Decks in Technology
See All in Technology
「勝手に広まる」人気 AI エージェントを爆速で作ろう!(AWS Summit Japan 2026講演資料)
minorun365
PRO
10
2.6k
“詰む”前に仕組みを作れ 〜技術の波に溺れないためのキャッチアップ術〜
takasyou
7
4.4k
AWS Summit Japan 2026の振り返りと2027へ向けて / AWS Summit Japan 2026 Recap and Prospects for 2027
kaminashi
1
100
Comment regagner la souveraineté de vos données tout en étant payé grâce à Nostr !
rlifchitz
0
220
クラウドファンディング版StackChan 3体(4体)をインタラクティブな体験型作品にして展示もした話 / スタックチャンお誕生日会2026
you
PRO
0
250
5分でわかる Amazon Connect_20260608
hwangbyeonghun
0
130
Hatena Engineer Seminar 37 jj1uzh
jj1uzh
0
180
スタートアップにAmazon EKSは早すぎる? マルチプロダクト戦略を加速する Platform Engineeringの実践 / Is Amazon EKS Too Soon for Startups? Practical Platform Engineering to Accelerate a Multi-Product Strategy
elmodev09
1
1.9k
打造你的 AI 工作流:Agent Skill + MCP 實戰工作坊
appleboy
0
180
Amazon Redshift zero-ETL 統合を活用した軽量なマルチプロダクトデータ可視化基盤 / Lightweight Multi-Product Data Visualization with Amazon Redshift Zero-ETL
kaminashi
0
110
徹底討論!ECS vs EKS!
daitak
3
1.8k
Fabricをフル活用する AI Agent Hub -製造業特化AIエージェントの設計
iotcomjpadmin
0
160
Featured
See All Featured
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
160
Automating Front-end Workflow
addyosmani
1370
210k
How to train your dragon (web standard)
notwaldorf
97
6.7k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
590
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
240
HTML-Aware ERB: The Path to Reactive Rendering @ RubyCon 2026, Rimini, Italy
marcoroth
2
250
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.8k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
210
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Scaling GitHub
holman
464
140k
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
620
Transcript
GitHub Insights Understanding Open Source @jeffmcaffer–Microsoft Georgios Gousios –Delft University
of Technology (TU Delft) Kevin Lewis – Microsoft
Snapshot overview
Inspire confidence
How open is a project? http://ghtorrent.org/pullreq-perf/
Commits (core vs community)
Commits (origin)
Comments (core vs community)
PR lifelines
Are we using git in a distributed way?
How may devs are there per country?
Insights
Business insights
Research insights
Cross-domain insights
Operational insights
Approach Data for the masses
GitHub by the numbers (Mid 2016)
Approach http://ghtorrent.org
How does it work? http://api.github.com/events
Example event (condensed) https://api.github.com/users/Cephei https://api.github.com/repos/PowerDMS/Owin.Scim https://api.github.com/repos/PowerDMS/Owin.Scim/commits/c751014f634d73e0b72f78a53c8cf137888b3 https://api.github.com/orgs/PowerDMS
Entities
GHTorrent architecture Github API Event Retrieval Commits Queue Project Events
Queue Events Data Retrieval Projects Commits evt.commit evt.watch evt.fork Data Retrieval Data Retrieval Data Retrieval Mirroring Cluster
GHTorrent by the numbers
Using the data You can do it too!
Using the data: Hosted http://ghtorrent.org
Using the data: Download
Using the data: Self-service https://github.com/ghtorrent/ghtorrent-webhook
Using the data: Azure Data Lake
Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights @gousiosg @jeffmcaffer @kelewis