Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
550
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
380
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
930
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
Other Decks in Technology
See All in Technology
DataOpsNight#8_Terragruntを用いたスケーラブルなSnowflakeインフラ管理
roki18d
1
320
about #74462 go/token#FileSet
tomtwinkle
1
280
Findy Team+のSOC2取得までの道のり
rvirus0817
0
300
Oracle Cloud Infrastructure:2025年9月度サービス・アップデート
oracle4engineer
PRO
0
370
From Prompt to Product @ How to Web 2025, Bucharest, Romania
janwerner
0
110
AI ReadyなData PlatformとしてのAutonomous Databaseアップデート
oracle4engineer
PRO
0
150
AWSにおけるTrend Vision Oneの効果について
shimak
0
110
C# 14 / .NET 10 の新機能 (RC 1 時点)
nenonaninu
1
1.5k
analysis パッケージの仕組みの上でMulti linter with configを実現する / Go Conference 2025
k1low
1
260
FastAPIの魔法をgRPC/Connect RPCへ
monotaro
PRO
1
680
関係性が駆動するアジャイル──GPTに人格を与えたら、対話を通してふりかえりを習慣化できた話
mhlyc
0
130
Windows で省エネ
murachiakira
0
160
Featured
See All Featured
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
9
570
[RailsConf 2023] Rails as a piece of cake
palkan
57
5.9k
Embracing the Ebb and Flow
colly
88
4.8k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.6k
The Pragmatic Product Professional
lauravandoore
36
6.9k
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
19
1.2k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
We Have a Design System, Now What?
morganepeng
53
7.8k
How STYLIGHT went responsive
nonsquared
100
5.8k
The World Runs on Bad Software
bkeepers
PRO
71
11k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
229
22k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github