Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
250
The troubles of modern dependency management and what to do about them
gousiosg
0
480
Mining Repositories with Apache Spark
gousiosg
0
600
My adventures with open everything
gousiosg
0
250
Structure and Evolution of Package Dependency Networks
gousiosg
0
710
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
330
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
880
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
250
Other Decks in Technology
See All in Technology
KubeCon NA 2024 Recap / Running WebAssembly (Wasm) Workloads Side-by-Side with Container Workloads
z63d
1
250
非機能品質を作り込むための実践アーキテクチャ
knih
5
1.3k
5分でわかるDuckDB
chanyou0311
10
3.2k
どちらを使う?GitHub or Azure DevOps Ver. 24H2
kkamegawa
0
790
多領域インシデントマネジメントへの挑戦:ハードウェアとソフトウェアの融合が生む課題/Challenge to multidisciplinary incident management: Issues created by the fusion of hardware and software
bitkey
PRO
2
100
LINEスキマニにおけるフロントエンド開発
lycorptech_jp
PRO
0
330
10分で学ぶKubernetesコンテナセキュリティ/10min-k8s-container-sec
mochizuki875
3
340
DUSt3R, MASt3R, MASt3R-SfM にみる3D基盤モデル
spatial_ai_network
2
120
re:Invent をおうちで楽しんでみた ~CloudWatch のオブザーバビリティ機能がスゴい!/ Enjoyed AWS re:Invent from Home and CloudWatch Observability Feature is Amazing!
yuj1osm
0
120
Postman と API セキュリティ / Postman and API Security
yokawasa
0
200
【re:Invent 2024 アプデ】 Prompt Routing の紹介
champ
0
140
KnowledgeBaseDocuments APIでベクトルインデックス管理を自動化する
iidaxs
1
260
Featured
See All Featured
The Language of Interfaces
destraynor
154
24k
Side Projects
sachag
452
42k
Art, The Web, and Tiny UX
lynnandtonic
298
20k
Six Lessons from altMBA
skipperchong
27
3.5k
Fashionably flexible responsive web design (full day workshop)
malarkey
405
66k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
44
9.3k
Into the Great Unknown - MozCon
thekraken
33
1.5k
Git: the NoSQL Database
bkeepers
PRO
427
64k
Designing on Purpose - Digital PM Summit 2013
jponch
116
7k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
26
1.5k
Building Better People: How to give real-time feedback that sticks.
wjessup
365
19k
GraphQLとの向き合い方2022年版
quramy
44
13k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github