Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
130k
3
Share
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
360
The troubles of modern dependency management and what to do about them
gousiosg
0
690
Mining Repositories with Apache Spark
gousiosg
0
710
My adventures with open everything
gousiosg
0
360
Structure and Evolution of Package Dependency Networks
gousiosg
0
900
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
440
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
980
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
Other Decks in Technology
See All in Technology
R&D 祭 2024 UE5で絵コンテ・作画の制作支援ツールをつくる話
olmdrd
PRO
0
200
コーディングエージェントはTypeScriptの 型エラーをどう自己修正しているのか
melonps
3
260
生成AI時代に信頼性をどう保ち続けるか - Policy as Code の実践
akitok_
1
530
マンション備え付けのネットワークとLTE回線を組み合わせた ネットワークの安定化の考案
harutiro
1
140
freee-mcpを Local→Remote で出してわかった MCP認可実装のリアル
terara
0
110
Oracle AI Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
6
1.7k
Loadbalancing exporter internals
ymotongpoo
1
120
インプロセスQAのための要因から捉えるプロジェクトリスクマネジメントnano #1 開発リソース効率状態への対処 #jasstnano
barus_qa
0
210
React Compiler導入の効果と運用の工夫
kakehashi
PRO
3
300
そのSLO 99.9%、本当に必要ですか? 〜優先度付きSLOによる責任共有の設計思想〜 / Is that 99.9% SLO really necessary? Design philosophy of shared responsibility through prioritized SLOs
vtryo
0
880
LT準備のToilを削減 〜決定論×確率論のスライド生成CLI〜
shukob
0
110
Databricks 月刊サービスアップデートまとめ 2026年04月号
tyosi1212
0
140
Featured
See All Featured
Game over? The fight for quality and originality in the time of robots
wayneb77
1
170
Reality Check: Gamification 10 Years Later
codingconduct
0
2.1k
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
510
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
140
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
180
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
49
3.4k
The SEO Collaboration Effect
kristinabergwall1
1
450
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.8k
Visualization
eitanlees
151
17k
Measuring Dark Social's Impact On Conversion and Attribution
stephenakadiri
2
200
Optimising Largest Contentful Paint
csswizardry
37
3.7k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github