Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
280
The troubles of modern dependency management and what to do about them
gousiosg
0
520
Mining Repositories with Apache Spark
gousiosg
0
650
My adventures with open everything
gousiosg
0
290
Structure and Evolution of Package Dependency Networks
gousiosg
0
760
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
360
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
920
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
280
Other Decks in Technology
See All in Technology
さくらのIaaS基盤のモニタリングとOpenTelemetry/OSC Hokkaido 2025
fujiwara3
2
260
Lazy application authentication with Tailscale
bluehatbrit
0
110
事業成長の裏側:エンジニア組織と開発生産性の進化 / 20250703 Rinto Ikenoue
shift_evolve
PRO
1
770
mrubyと micro-ROSが繋ぐロボットの世界
kishima
2
380
マーケットプレイス版Oracle WebCenter Content For OCI
oracle4engineer
PRO
3
940
モバイル界のMCPを考える
naoto33
0
360
250627 関西Ruby会議08 前夜祭 RejectKaigi「DJ on Ruby Ver.0.1」
msykd
PRO
2
370
作曲家がボカロを使うようにPdMはAIを使え
itotaxi
0
390
FOSS4G 2025 KANSAI QGISで点群データをいろいろしてみた
kou_kita
0
270
rubygem開発で鍛える設計力
joker1007
2
270
How Community Opened Global Doors
hiroramos4
PRO
1
130
Core Audio tapを使ったリアルタイム音声処理のお話
yuta0306
0
160
Featured
See All Featured
Build your cross-platform service in a week with App Engine
jlugia
231
18k
Stop Working from a Prison Cell
hatefulcrawdad
270
20k
GitHub's CSS Performance
jonrohan
1031
460k
Being A Developer After 40
akosma
90
590k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Six Lessons from altMBA
skipperchong
28
3.9k
Embracing the Ebb and Flow
colly
86
4.7k
Why You Should Never Use an ORM
jnunemaker
PRO
58
9.4k
Documentation Writing (for coders)
carmenintech
72
4.9k
Measuring & Analyzing Core Web Vitals
bluesmoon
7
500
Gamification - CAS2011
davidbonilla
81
5.3k
A designer walks into a library…
pauljervisheath
207
24k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github