Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
330
The troubles of modern dependency management and what to do about them
gousiosg
0
650
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
350
Structure and Evolution of Package Dependency Networks
gousiosg
0
870
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
970
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
Other Decks in Technology
See All in Technology
Scrumは歪む — 組織設計の原理原則
dashi
0
130
タスク管理も1on1も、もう「管理」じゃない ― KiroとBedrock AgentCoreで変わった"判断の仕事"
yusukeshimizu
5
2.6k
Datadog の RBAC のすべて
nulabinc
PRO
3
450
[JAWSDAYS2026]Who is responsible for IAM
mizukibbb
0
440
8万デプロイ
iwamot
PRO
2
230
2026-03-11 JAWS-UG 茨城 #12 改めてALBを便利に使う
masasuzu
2
360
[JAWS DAYS 2026]私の AWS DevOps Agent 推しポイント
furuton
0
140
EMからVPoEを経てCTOへ:マネジメントキャリアパスにおける葛藤と成長
kakehashi
PRO
9
1.7k
元エンジニアPdM、IDEが恋しすぎてCursorに全業務を集約したら、スライド作成まで爆速になった話
doiko123
1
590
Claude Code 2026年 最新アップデート
oikon48
10
8.2k
Yahoo!ショッピングのレコメンデーション・システムにおけるML実践の一例
lycorptech_jp
PRO
1
190
ナレッジワークのご紹介(第88回情報処理学会 )
kworkdev
PRO
0
180
Featured
See All Featured
How Software Deployment tools have changed in the past 20 years
geshan
0
32k
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
290
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
0
160
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.2k
Game over? The fight for quality and originality in the time of robots
wayneb77
1
130
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.8k
How to Ace a Technical Interview
jacobian
281
24k
Automating Front-end Workflow
addyosmani
1370
200k
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
78
AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference
aleyda
1
1.1k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.3k
AI: The stuff that nobody shows you
jnunemaker
PRO
3
370
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github