Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
320
The troubles of modern dependency management and what to do about them
gousiosg
0
640
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
340
Structure and Evolution of Package Dependency Networks
gousiosg
0
840
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
960
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
330
Other Decks in Technology
See All in Technology
コスト削減から「セキュリティと利便性」を担うプラットフォームへ
sansantech
PRO
3
1.6k
ClickHouseはどのように大規模データを活用したAIエージェントを全社展開しているのか
mikimatsumoto
0
260
AIエージェントを開発しよう!-AgentCore活用の勘所-
yukiogawa
0
180
モダンUIでフルサーバーレスなAIエージェントをAmplifyとCDKでサクッとデプロイしよう
minorun365
4
220
AzureでのIaC - Bicep? Terraform? それ早く言ってよ会議
torumakabe
1
590
Oracle Cloud Observability and Management Platform - OCI 運用監視サービス概要 -
oracle4engineer
PRO
2
14k
AIと新時代を切り拓く。これからのSREとメルカリIBISの挑戦
0gm
2
3k
レガシー共有バッチ基盤への挑戦 - SREドリブンなリアーキテクチャリングの取り組み
tatsukoni
0
220
AI駆動開発を事業のコアに置く
tasukuonizawa
1
340
外部キー制約の知っておいて欲しいこと - RDBMSを正しく使うために必要なこと / FOREIGN KEY Night
soudai
PRO
12
5.6k
Ruby版 JSXのRuxが気になる
sansantech
PRO
0
160
SRE Enabling戦記 - 急成長する組織にSREを浸透させる戦いの歴史
markie1009
0
150
Featured
See All Featured
Six Lessons from altMBA
skipperchong
29
4.2k
Being A Developer After 40
akosma
91
590k
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
650
It's Worth the Effort
3n
188
29k
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
920
Noah Learner - AI + Me: how we built a GSC Bulk Export data pipeline
techseoconnect
PRO
0
110
ReactJS: Keep Simple. Everything can be a component!
pedronauck
666
130k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
750
Scaling GitHub
holman
464
140k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Building an army of robots
kneath
306
46k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github