Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
550
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
380
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
930
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
300
Other Decks in Technology
See All in Technology
生成AIとM5Stack / M5 Japan Tour 2025 Autumn 東京
you
PRO
0
250
20251007: What happens when multi-agent systems become larger? (CyberAgent, Inc)
ornew
1
190
Modern_Data_Stack最新動向クイズ_買収_AI_激動の2025年_.pdf
sagara
0
240
なぜAWSを活かしきれないのか?技術と組織への処方箋
nrinetcom
PRO
4
690
OCI Network Firewall 概要
oracle4engineer
PRO
2
7.9k
三菱電機・ソニーグループ共同の「Agile Japan企業内サテライト」_2025
sony
0
140
[Keynote] What do you need to know about DevEx in 2025
salaboy
0
160
Where will it converge?
ibknadedeji
0
210
大規模サーバーレスAPIの堅牢性・信頼性設計 〜AWSのベストプラクティスから始まる現実的制約との向き合い方〜
maimyyym
6
4.1k
セキュアな認可付きリモートMCPサーバーをAWSマネージドサービスでつくろう! / Let's build an OAuth protected remote MCP server based on AWS managed services
kaminashi
3
290
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
3
5.5k
Adapty_東京AI祭ハッカソン2025ピッチスライド
shinoyamada
0
270
Featured
See All Featured
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
Build The Right Thing And Hit Your Dates
maggiecrowley
37
2.9k
Visualization
eitanlees
149
16k
Embracing the Ebb and Flow
colly
88
4.8k
Build your cross-platform service in a week with App Engine
jlugia
232
18k
Agile that works and the tools we love
rasmusluckow
331
21k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
32
2.3k
The Cost Of JavaScript in 2023
addyosmani
54
9k
A better future with KSS
kneath
239
18k
Reflections from 52 weeks, 52 projects
jeffersonlam
352
21k
Music & Morning Musume
bryan
46
6.8k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.7k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github