Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
130k
3
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
380
The troubles of modern dependency management and what to do about them
gousiosg
0
710
Mining Repositories with Apache Spark
gousiosg
0
740
My adventures with open everything
gousiosg
0
370
Structure and Evolution of Package Dependency Networks
gousiosg
0
930
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
450
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
990
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
350
Other Decks in Technology
See All in Technology
10年間のブログ発信を振り返って見えたWebアプリケーションエンジニアとしての軌跡
stefafafan
0
190
攻撃者がいなくてもAIエージェントはインシデントを起こす
nomizone
0
150
MySQL & MySQL HeatWave Report - June 2026
freshdaz
0
200
#エンジニアBooks 30分でわかる 「技術記事を書く技術」 / engineer-books 2026-06-30
jnchito
1
130
ご挨拶「10周年を迎える共創ラボのこれまでとこれから」
iotcomjpadmin
0
150
アラート調査向けAIエージェントの本番導入とその後/AI Agents for Alert Investigation: Production Deployment and After
taddy_919
1
250
toB プロダクトから見たWAF
tokai235
0
250
WebGIS AI Agentの紹介
_shimizu
0
590
Text-to-SQLをAgentCoreで実現し、生成されるSQLの精度を定量的に評価する
yakumo
2
130
テスト設計の本質を改めて考えてみる~生成AIを活用する時代だからこそ、作ったテストの説明性を高めよう~
yamasaki696
1
140
【FinOps】データドリブンな意思決定を目指して
z63d
2
490
From Prompt Engineering to Loop Engineering
shibuiwilliam
1
280
Featured
See All Featured
Building Applications with DynamoDB
mza
96
7.1k
Why Our Code Smells
bkeepers
PRO
340
58k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
140
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.2k
GraphQLとの向き合い方2022年版
quramy
50
15k
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1.1k
Java REST API Framework Comparison - PWX 2021
mraible
34
9.4k
Crafting Experiences
bethany
1
190
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.8k
Are puppies a ranking factor?
jonoalderson
1
3.7k
Game over? The fight for quality and originality in the time of robots
wayneb77
1
210
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github