Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Georgios Gousios
May 17, 2013
Technology
130k
4
Share
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
340
The troubles of modern dependency management and what to do about them
gousiosg
0
660
Mining Repositories with Apache Spark
gousiosg
0
710
My adventures with open everything
gousiosg
0
350
Structure and Evolution of Package Dependency Networks
gousiosg
0
890
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
430
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
970
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
Other Decks in Technology
See All in Technology
AIがコードを書く時代の ジェネレーティブプログラミング
polidog
PRO
3
550
機能・非機能の学びを一つに!Agent Skillsで月間レポート作成始めてみた / Unifying Bug & Infra Insights — Building Monthly Quality Reports with Agent Skills
bun913
5
3.5k
組織的なAI活用を阻む 最大のハードルは コンテキストデザインだった
ixbox
1
910
ASTのGitHub CopilotとCopilot CLIの現在地をお話しします/How AST Operates GitHub Copilot and Copilot CLI
aeonpeople
1
190
本番環境でPHPコードに触れずに「使われていないコード」を調べるにはどうしたらよいか?
egmc
1
200
推し活エージェント
yuntan_t
1
860
会社紹介資料 / Sansan Company Profile
sansan33
PRO
16
410k
自己組織化を試される緑茶ハイを求めて、今日も全力であそんで学ぼう / Self-Organization and Shochu Green Tea
naitosatoshi
0
150
チームで育てるAI自走環境_20260409
fuktig
0
880
ログ基盤・プラグイン・ダッシュボード、全部整えた。でも最後は人だった。
makikub
4
780
【関西電力KOI×VOLTMIND 生成AIハッカソン】空間AIブレイン ~⼤阪おばちゃんフィジカルAIに続く道~
tanakaseiya
0
170
プロダクトを触って語って理解する、チーム横断バグバッシュのすすめ / 20260411 Naoki Takahashi
shift_evolve
PRO
1
180
Featured
See All Featured
Heart Work Chapter 1 - Part 1
lfama
PRO
5
35k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
340
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
160
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
3.8k
How to audit for AI Accessibility on your Front & Back End
davetheseo
0
230
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.3k
Un-Boring Meetings
codingconduct
0
250
Unsuck your backbone
ammeep
672
58k
Technical Leadership for Architectural Decision Making
baasie
3
310
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
160
The innovator’s Mindset - Leading Through an Era of Exponential Change - McGill University 2025
jdejongh
PRO
1
150
Build The Right Thing And Hit Your Dates
maggiecrowley
39
3.1k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github