Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
300
The troubles of modern dependency management and what to do about them
gousiosg
0
560
Mining Repositories with Apache Spark
gousiosg
0
670
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
800
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
390
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
940
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
300
Other Decks in Technology
See All in Technology
仕様は“書く”より“語る” - 分断を超えたチーム開発の実践 / 20251115 Naoki Takahashi
shift_evolve
PRO
1
400
今日から使える AWS Step Functions 小技集 / AWS Step Functions Tips
kinunori
7
650
AWS資格は取ったけどIAMロールを腹落ちできてなかったので、年内に整理してみた
hiro_eng_
0
200
機密情報の漏洩を防げ! Webフロントエンド開発で意識すべき漏洩パターンとその対策
mizdra
PRO
7
1.7k
AWS 環境で GitLab Self-managed を試してみた/aws-gitlab-self-managed
emiki
0
360
マーケットプレイス版Oracle WebCenter Content For OCI
oracle4engineer
PRO
3
1.3k
決済システムの信頼性を支える技術と運用の実践
ykagano
0
490
コンピューティングリソース何を使えばいいの?
tomokusaba
1
140
Flutterコントリビューションのススメ
d_r_1009
1
350
Copilotの精度を上げる!カスタムプロンプト入門.pdf
ismk
10
3.3k
Design and implementation of "Markdown to Google Slides" / phpconfuk 2025
k1low
1
390
こんな時代だからこそ! 想定しておきたいアクセスキー漏洩後のムーブ
takuyay0ne
4
540
Featured
See All Featured
Rebuilding a faster, lazier Slack
samanthasiow
84
9.3k
Thoughts on Productivity
jonyablonski
73
4.9k
A Modern Web Designer's Workflow
chriscoyier
697
190k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
127
54k
jQuery: Nuts, Bolts and Bling
dougneiner
65
8k
GitHub's CSS Performance
jonrohan
1032
470k
The Straight Up "How To Draw Better" Workshop
denniskardys
239
140k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.1k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.8k
Faster Mobile Websites
deanohume
310
31k
Why You Should Never Use an ORM
jnunemaker
PRO
60
9.6k
Code Review Best Practice
trishagee
72
19k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github