Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
210
The troubles of modern dependency management and what to do about them
gousiosg
0
420
Mining Repositories with Apache Spark
gousiosg
0
510
My adventures with open everything
gousiosg
0
210
Structure and Evolution of Package Dependency Networks
gousiosg
0
650
Mining Github for fun and profit
gousiosg
9
62k
GitHub Insights: Understanding Open Source
gousiosg
0
300
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
840
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
220
Other Decks in Technology
See All in Technology
株式会社EventHub・エンジニア採用資料
eventhub
0
1.9k
入社後初めてのタスクでk8sアップグレードした話.pdf
kkato1
0
380
"好き"との生活/Regularly update profile with GitHub Actions
judeeeee
0
150
**強い**エンジニアのなり方 - フィードバックサイクルを勝ち取る / grow one day each day
soudai
60
17k
20240416_devopsdaystokyo
kzkmaeda
1
180
SIEMを用いて、セキュリティログ分析の可視化と分析を実現し、PDCAサイクルを回してみた
coconala_engineer
0
200
強みを伸ばすキャリアデザイン
yug1224
0
200
Aurora MySQL v3(MySQL8.0互換)の オンラインDDLの罠挙動を全バージョンで検証した
yutakikai
0
150
コンパウンドスタートアップのためのスケーラブルでセキュアなInfrastructure as Codeパイプラインを考える / Scalable and Secure Infrastructure as Code Pipeline for a Compound Startup
yuyatakeyama
3
1.8k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs (QCon London)
inesmontani
PRO
0
150
Terraformあれやこれ/terraform-this-and-that
emiki
2
100
Apple Vision Pro trial session
akkeylab
0
120
Featured
See All Featured
Scaling GitHub
holman
457
140k
The MySQL Ecosystem @ GitHub 2015
samlambert
242
12k
How STYLIGHT went responsive
nonsquared
92
4.8k
The Pragmatic Product Professional
lauravandoore
24
5.8k
How to train your dragon (web standard)
notwaldorf
71
5.1k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
115
18k
Facilitating Awesome Meetings
lara
40
5.6k
How GitHub Uses GitHub to Build GitHub
holman
468
290k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
243
20k
Optimising Largest Contentful Paint
csswizardry
7
2.3k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
5
1.5k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
320
20k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github