Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
GitHub Insights: Understanding Open Source
Search
Georgios Gousios
May 19, 2016
Technology
0
380
GitHub Insights: Understanding Open Source
Talk given at OSCON 2016
Georgios Gousios
May 19, 2016
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
550
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
930
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
The #issue32 incident
gousiosg
2
16k
Other Decks in Technology
See All in Technology
Optuna DashboardにおけるPLaMo2連携機能の紹介 / PFN LLM セミナー
pfn
PRO
2
900
綺麗なデータマートをつくろう_データ整備を前向きに考える会 / Let's create clean data mart
brainpadpr
2
260
SREとソフトウェア開発者の合同チームはどのようにS3のコストを削減したか?
muziyoshiz
1
100
AIAgentの限界を超え、 現場を動かすWorkflowAgentの設計と実践
miyatakoji
0
150
いま注目しているデータエンジニアリングの論点
ikkimiyazaki
0
610
業務自動化プラットフォーム Google Agentspace に入門してみる #devio2025
maroon1st
0
200
研究開発部メンバーの働き⽅ / Sansan R&D Profile
sansan33
PRO
3
20k
Oracle Cloud Infrastructure:2025年9月度サービス・アップデート
oracle4engineer
PRO
0
470
PLaMoの事後学習を支える技術 / PFN LLMセミナー
pfn
PRO
9
3.9k
10年の共創が示す、これからの開発者と企業の関係 ~ Crossroad
soracom
PRO
1
540
Large Vision Language Modelを用いた 文書画像データ化作業自動化の検証、運用 / shibuya_AI
sansan_randd
0
110
How to achieve interoperable digital identity across Asian countries
fujie
0
120
Featured
See All Featured
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.5k
Large-scale JavaScript Application Architecture
addyosmani
514
110k
Raft: Consensus for Rubyists
vanstee
139
7.1k
[RailsConf 2023] Rails as a piece of cake
palkan
57
5.9k
How to train your dragon (web standard)
notwaldorf
96
6.3k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
9
580
The Language of Interfaces
destraynor
162
25k
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
Into the Great Unknown - MozCon
thekraken
40
2.1k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4k
Music & Morning Musume
bryan
46
6.8k
Transcript
GitHub Insights Understanding Open Source @jeffmcaffer–Microsoft Georgios Gousios –Delft University
of Technology (TU Delft) Kevin Lewis – Microsoft
Snapshot overview
Inspire confidence
How open is a project? http://ghtorrent.org/pullreq-perf/
Commits (core vs community)
Commits (origin)
Comments (core vs community)
PR lifelines
Are we using git in a distributed way?
How may devs are there per country?
Insights
Business insights
Research insights
Cross-domain insights
Operational insights
Approach Data for the masses
GitHub by the numbers (Mid 2016)
Approach http://ghtorrent.org
How does it work? http://api.github.com/events
Example event (condensed) https://api.github.com/users/Cephei https://api.github.com/repos/PowerDMS/Owin.Scim https://api.github.com/repos/PowerDMS/Owin.Scim/commits/c751014f634d73e0b72f78a53c8cf137888b3 https://api.github.com/orgs/PowerDMS
Entities
GHTorrent architecture Github API Event Retrieval Commits Queue Project Events
Queue Events Data Retrieval Projects Commits evt.commit evt.watch evt.fork Data Retrieval Data Retrieval Data Retrieval Mirroring Cluster
GHTorrent by the numbers
Using the data You can do it too!
Using the data: Hosted http://ghtorrent.org
Using the data: Download
Using the data: Self-service https://github.com/ghtorrent/ghtorrent-webhook
Using the data: Azure Data Lake
Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights @gousiosg @jeffmcaffer @kelewis