Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
GitHub Insights: Understanding Open Source
Search
Georgios Gousios
May 19, 2016
Technology
440
0
Share
GitHub Insights: Understanding Open Source
Talk given at OSCON 2016
Georgios Gousios
May 19, 2016
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
360
The troubles of modern dependency management and what to do about them
gousiosg
0
690
Mining Repositories with Apache Spark
gousiosg
0
710
My adventures with open everything
gousiosg
0
360
Structure and Evolution of Package Dependency Networks
gousiosg
0
900
Mining Github for fun and profit
gousiosg
9
63k
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
980
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
The #issue32 incident
gousiosg
2
16k
Other Decks in Technology
See All in Technology
続 運用改善、不都合な真実 〜 物理制約のない運用改善はほとんど無価値 / 20260518-ssmjp-kaizen-no-value-without-physical-constraints
opelab
2
290
キャリア25年目にしてTypeScript に出会うまで - 「型」を通じて振り返るプログラミング言語遍歴 / Meeting TypeScript After 25 Years in Tech - Looking Back at My Programming Language Journey Through "Types"
bitkey
PRO
2
130
SpeechTranscriber + AIによる文字起こし機能
kazuki1220
0
120
ECSのTerraformモジュールにコントリビュートした話
harukasakihara
0
270
AWS運用におけるAI Agent活用術 / JAWS-UG 神戸 #11 LT大会
genda
1
320
[みん強]AIの価値を最大化するデータ基盤戦略:Self-Service型Data Meshへの転換とAgentic AI Meshに向けた取り組み with Snowflake他
y_matsubara
1
160
TSKaigi 2026 - enumよ、さようなら
teamlab
PRO
2
240
React Compiler導入の効果と運用の工夫
kakehashi
PRO
3
300
ジュニアエンジニアはSREとどう向き合うべきか
nrinetcom
PRO
0
100
The Bag-of-Documents Model for Query Understanding and Retrieval
dtunkelang
0
180
R&D 祭 2024 UE5で絵コンテ・作画の制作支援ツールをつくる話
olmdrd
PRO
0
200
PdM・Eng・QAで進めるAI駆動開発の現在地/aidd-with-pdm-eng-qa
shota_kusaba
0
260
Featured
See All Featured
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.7k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.9k
GraphQLの誤解/rethinking-graphql
sonatard
75
12k
Designing for Timeless Needs
cassininazir
1
220
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.2k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.8k
Faster Mobile Websites
deanohume
310
31k
KATA
mclloyd
PRO
35
15k
BBQ
matthewcrist
89
10k
Git: the NoSQL Database
bkeepers
PRO
432
67k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.3k
Transcript
GitHub Insights Understanding Open Source @jeffmcaffer–Microsoft Georgios Gousios –Delft University
of Technology (TU Delft) Kevin Lewis – Microsoft
Snapshot overview
Inspire confidence
How open is a project? http://ghtorrent.org/pullreq-perf/
Commits (core vs community)
Commits (origin)
Comments (core vs community)
PR lifelines
Are we using git in a distributed way?
How may devs are there per country?
Insights
Business insights
Research insights
Cross-domain insights
Operational insights
Approach Data for the masses
GitHub by the numbers (Mid 2016)
Approach http://ghtorrent.org
How does it work? http://api.github.com/events
Example event (condensed) https://api.github.com/users/Cephei https://api.github.com/repos/PowerDMS/Owin.Scim https://api.github.com/repos/PowerDMS/Owin.Scim/commits/c751014f634d73e0b72f78a53c8cf137888b3 https://api.github.com/orgs/PowerDMS
Entities
GHTorrent architecture Github API Event Retrieval Commits Queue Project Events
Queue Events Data Retrieval Projects Commits evt.commit evt.watch evt.fork Data Retrieval Data Retrieval Data Retrieval Mirroring Cluster
GHTorrent by the numbers
Using the data You can do it too!
Using the data: Hosted http://ghtorrent.org
Using the data: Download
Using the data: Self-service https://github.com/ghtorrent/ghtorrent-webhook
Using the data: Azure Data Lake
Resources http://ghtorrent.org https://github.com/Microsoft/ghinsights @gousiosg @jeffmcaffer @kelewis