Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
130k
3
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
370
The troubles of modern dependency management and what to do about them
gousiosg
0
700
Mining Repositories with Apache Spark
gousiosg
0
730
My adventures with open everything
gousiosg
0
360
Structure and Evolution of Package Dependency Networks
gousiosg
0
910
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
450
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
990
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
350
Other Decks in Technology
See All in Technology
イベントストーミングとKiroの仕様駆動開発で実現する要件の認識合わせプロセス
syobochim
7
1.2k
Agentic Web
dynamis
1
140
Oracle AI Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
4
2.8k
実装は速くなった、レビューはどうする? ― 自身のレビューをAIで再現させるサーヴァントエンジニアリングのすゝめ / Implementation got faster. So what about reviews? — An invitation to Servant Engineering: Recreating your own code reviews with AI
nrslib
7
3.9k
Claude code Orchestra
ozakiomumkj
3
970
製造業のクラウド活用最適解〜AI,DXを加速するデータ基盤の作り方〜
hamadakoji
0
380
TypeScript Compiler APIとPHP-Parserを活用し、TypeScriptとPHPで型を共有する
shuta13
0
360
AI と創る新たな世界 / A New World Created with AI
ks91
PRO
0
110
OCI Oracle AI Database Services新機能アップデート(2026/03-2026/05)
oracle4engineer
PRO
0
220
価格.comをAI駆動で全面刷新する ー 30年分の技術的負債を返し、次の30年の土台をつくる ー / AI Engineering Summit Tokyo 2026
tkyowa
49
53k
Building applications in the Gemini API family.
line_developers_tw
PRO
0
1.6k
Dynamic Workersについて
yusukebe
2
590
Featured
See All Featured
How to train your dragon (web standard)
notwaldorf
97
6.7k
Prompt Engineering for Job Search
mfonobong
0
330
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.7k
Embracing the Ebb and Flow
colly
88
5.1k
Art, The Web, and Tiny UX
lynnandtonic
304
22k
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.6k
Deep Space Network (abreviated)
tonyrice
0
160
Java REST API Framework Comparison - PWX 2021
mraible
34
9.3k
The Power of CSS Pseudo Elements
geoffreycrofte
82
6.3k
The Pragmatic Product Professional
lauravandoore
37
7.3k
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Bash Introduction
62gerente
615
210k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github