Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
280
The troubles of modern dependency management and what to do about them
gousiosg
0
530
Mining Repositories with Apache Spark
gousiosg
0
650
My adventures with open everything
gousiosg
0
290
Structure and Evolution of Package Dependency Networks
gousiosg
0
770
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
370
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
920
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
Other Decks in Technology
See All in Technology
みんなのSRE 〜チーム全員でのSRE活動にするための4つの取り組み〜
kakehashi
PRO
2
110
AI エンジニアの立場からみた、AI コーディング時代の開発の品質向上の取り組みと妄想
soh9834
8
620
怖くない!GritQLでBiomeプラグインを作ろうよ
pal4de
1
150
増え続ける脆弱性に立ち向かう: 事前対策と優先度づけによる 持続可能な脆弱性管理 / Confronting the Rise of Vulnerabilities: Sustainable Management Through Proactive Measures and Prioritization
nttcom
1
230
LLMでAI-OCR、実際どうなの? / llm_ai_ocr_layerx_bet_ai_day_lt
sbrf248
0
370
【CEDEC2025】大規模言語モデルを活用したゲーム内会話パートのスクリプト作成支援への取り組み
cygames
PRO
1
540
P2P ではじめる WebRTC のつまづきどころ
tnoho
1
280
製造業の課題解決に向けた機械学習の活用と、製造業特化LLM開発への挑戦
knt44kw
0
110
地域コミュニティへの「感謝」と「恩返し」 / 20250726jawsug-tochigi
kasacchiful
0
110
隙間時間で爆速開発! Claude Code × Vibe Coding で作るマニュアル自動生成サービス
akitomonam
2
240
MCPに潜むセキュリティリスクを考えてみる
milix_m
1
920
興味の胞子を育て 業務と技術に広がる”きのこ力”
fumiyasac0921
0
440
Featured
See All Featured
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Building a Scalable Design System with Sketch
lauravandoore
462
33k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.6k
The Pragmatic Product Professional
lauravandoore
35
6.8k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
357
30k
GraphQLとの向き合い方2022年版
quramy
49
14k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
10
1k
Building Applications with DynamoDB
mza
95
6.5k
Art, The Web, and Tiny UX
lynnandtonic
301
21k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
283
13k
The Straight Up "How To Draw Better" Workshop
denniskardys
235
140k
Keith and Marios Guide to Fast Websites
keithpitt
411
22k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github