Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
260
The troubles of modern dependency management and what to do about them
gousiosg
0
490
Mining Repositories with Apache Spark
gousiosg
0
610
My adventures with open everything
gousiosg
0
260
Structure and Evolution of Package Dependency Networks
gousiosg
0
720
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
340
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
900
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
250
Other Decks in Technology
See All in Technology
一度 Expo の採用を断念したけど、 再度 Expo の導入を検討している話
ichiki1023
1
170
Oracle Cloud Infrastructure:2025年2月度サービス・アップデート
oracle4engineer
PRO
1
220
RSNA2024振り返り
nanachi
0
590
リーダブルテストコード 〜メンテナンスしやすい テストコードを作成する方法を考える〜 #DevSumi #DevSumiB / Readable test code
nihonbuson
11
7.3k
プロセス改善による品質向上事例
tomasagi
2
2.6k
Culture Deck
optfit
0
430
JEDAI Meetup! Databricks AI/BI概要
databricksjapan
0
150
Helm , Kustomize に代わる !? 次世代 k8s パッケージマネージャー Glasskube 入門 / glasskube-entry
parupappa2929
0
250
データマネジメントのトレードオフに立ち向かう
ikkimiyazaki
6
1k
プロダクトエンジニア構想を立ち上げ、プロダクト志向な組織への成長を続けている話 / grow into a product-oriented organization
hiro_torii
1
220
次世代KYC活動報告 / 20250219-BizDay17-KYC-nextgen
oidfj
0
260
Oracle Base Database Service 技術詳細
oracle4engineer
PRO
6
57k
Featured
See All Featured
A Philosophy of Restraint
colly
203
16k
How GitHub (no longer) Works
holman
314
140k
How to Ace a Technical Interview
jacobian
276
23k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
VelocityConf: Rendering Performance Case Studies
addyosmani
328
24k
Six Lessons from altMBA
skipperchong
27
3.6k
GraphQLの誤解/rethinking-graphql
sonatard
68
10k
Optimizing for Happiness
mojombo
376
70k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
45
9.4k
Statistics for Hackers
jakevdp
797
220k
It's Worth the Effort
3n
184
28k
A better future with KSS
kneath
238
17k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github