Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
260
The troubles of modern dependency management and what to do about them
gousiosg
0
480
Mining Repositories with Apache Spark
gousiosg
0
610
My adventures with open everything
gousiosg
0
250
Structure and Evolution of Package Dependency Networks
gousiosg
0
710
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
340
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
890
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
250
Other Decks in Technology
See All in Technology
Reactフレームワークプロダクトを モバイルアプリにして、もっと便利に。 ユーザに価値を届けよう。/React Framework with Capacitor
rdlabo
0
130
Copilotの力を実感!3ヶ月間の生成AI研修の試行錯誤&成功事例をご紹介。果たして得たものとは・・?
ktc_shiori
0
350
タイミーのデータ活用を支えるdbt Cloud導入とこれから
ttccddtoki
1
150
Kotlin Multiplatformのポテンシャル
recruitengineers
PRO
2
150
Building Scalable Backend Services with Firebase
wisdommatt
0
110
re:Invent2024 KeynoteのAmazon Q Developer考察
yusukeshimizu
1
150
機械学習を「社会実装」するということ 2025年版 / Social Implementation of Machine Learning 2025 Version
moepy_stats
5
1.2k
ABWGのRe:Cap!
hm5ug
1
120
Oracle Base Database Service:サービス概要のご紹介
oracle4engineer
PRO
1
16k
シフトライトなテスト活動を適切に行うことで、無理な開発をせず、過剰にテストせず、顧客をビックリさせないプロダクトを作り上げているお話 #RSGT2025 / Shift Right
nihonbuson
3
2.1k
20250116_自部署内でAmazon Nova体験会をやってみた話
riz3f7
1
100
新卒1年目、はじめてのアプリケーションサーバー【IBM WebSphere Liberty】
ktgrryt
0
120
Featured
See All Featured
The Pragmatic Product Professional
lauravandoore
32
6.4k
Speed Design
sergeychernyshev
25
740
How GitHub (no longer) Works
holman
312
140k
How STYLIGHT went responsive
nonsquared
96
5.3k
The Cult of Friendly URLs
andyhume
78
6.1k
Practical Orchestrator
shlominoach
186
10k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
26
1.9k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
Designing for Performance
lara
604
68k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
98
18k
Java REST API Framework Comparison - PWX 2021
mraible
28
8.3k
A designer walks into a library…
pauljervisheath
205
24k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github