Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
540
Mining Repositories with Apache Spark
gousiosg
0
650
My adventures with open everything
gousiosg
0
300
Structure and Evolution of Package Dependency Networks
gousiosg
0
780
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
370
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
920
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
Other Decks in Technology
See All in Technology
第64回コンピュータビジョン勉強会@関東(後編)
tsukamotokenji
0
210
認知戦の理解と、市民としての対抗策
hogehuga
0
240
サービスロボット最前線:ugoが挑むPhysical AI活用
kmatsuiugo
0
180
KiroでGameDay開催してみよう(準備編)
yuuuuuuu168
1
110
AI時代の大規模データ活用とセキュリティ戦略
ken5scal
1
280
[CV勉強会@関東 CVPR2025 読み会] MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos (Li+, CVPR2025)
abemii
0
180
Exadata Database Service on Dedicated Infrastructure セキュリティ、ネットワーク、および管理について
oracle4engineer
PRO
1
350
広島発!スタートアップ開発の裏側
tsankyo
0
190
Claude Code x Androidアプリ 開発
kgmyshin
1
510
第4回 関東Kaggler会 [Training LLMs with Limited VRAM]
tascj
11
1.5k
[OCI Skill Mapping] AWSユーザーのためのOCI(2025年8月20日開催)
oracle4engineer
PRO
2
110
広島銀行におけるAWS活用の取り組みについて
masakimori
0
110
Featured
See All Featured
The Art of Programming - Codeland 2020
erikaheidi
55
13k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.8k
Testing 201, or: Great Expectations
jmmastey
45
7.6k
Building an army of robots
kneath
306
45k
Designing for Performance
lara
610
69k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1.1k
Being A Developer After 40
akosma
90
590k
Fantastic passwords and where to find them - at NoRuKo
philnash
51
3.4k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
131
19k
How GitHub (no longer) Works
holman
315
140k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Gamification - CAS2011
davidbonilla
81
5.4k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github