Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The GHTorrent dataset and toolsuite
Search
Georgios Gousios
May 17, 2013
Technology
4
130k
The GHTorrent dataset and toolsuite
MSR2013 data paper presentation
Georgios Gousios
May 17, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
330
The troubles of modern dependency management and what to do about them
gousiosg
0
650
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
350
Structure and Evolution of Package Dependency Networks
gousiosg
0
850
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
970
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
Other Decks in Technology
See All in Technology
オンプレとGoogle Cloudを安全に繋ぐための、セキュア通信の勘所
waiwai2111
3
1.1k
「ヒットする」+「近い」を同時にかなえるスマートサジェストの作り方.pdf
nakasho
0
130
タスク管理も1on1も、もう「管理」じゃない ― KiroとBedrock AgentCoreで変わった"判断の仕事"
yusukeshimizu
0
220
ソフトウェアアーキテクトのための意思決定術: Create Decision Readiness—The Real Skill Behind Architectural Decision
snoozer05
PRO
30
9.1k
Oracle Database@AWS:サービス概要のご紹介
oracle4engineer
PRO
4
1.6k
問い合わせ自動化の技術的挑戦
recruitengineers
PRO
2
150
Oracle Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
4
1.1k
ヘルシーSRE
tk3fftk
2
240
Introduction to Sansan, inc / Sansan Global Development Center, Inc.
sansan33
PRO
0
3k
AIエンジニア Devin と歩む、自律型運用プロセスの構築
a2ito
0
700
技術的負債の泥沼から組織を救う3つの転換点
nwiizo
8
2.5k
生成AIの利用とセキュリティ /gen-ai-and-security
mizutani
1
1.2k
Featured
See All Featured
Abbi's Birthday
coloredviolet
2
5.1k
4 Signs Your Business is Dying
shpigford
187
22k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.1k
Why Mistakes Are the Best Teachers: Turning Failure into a Pathway for Growth
auna
0
74
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
430
Optimising Largest Contentful Paint
csswizardry
37
3.6k
It's Worth the Effort
3n
188
29k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
3.7k
Done Done
chrislema
186
16k
Designing for Timeless Needs
cassininazir
0
150
How to Ace a Technical Interview
jacobian
281
24k
Transcript
The GHTorrent Dataset and Tool Suite Georgios Gousios Software Engineering
Research Group TU Delft
All data from Github
Ready to be queried
ghtorrent.org
Repositories
Commits
Pull requests
Issues
Users and Organizations
Mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams Recursive dependency retrieval
Build relational database to query
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org { 88"type":8"User",
88"public_gists":80, 88"login":8"gousiosg", 88"followers":88, 88"name":8"Georgios8Gousios", 88"public_repos":84, 88"created_at":8..., 88"id":8386172, 88"following":84, } { . . . CoSQL database as cache
Periodic dumps of DBs online
Query relational DB online
$ gem install sqlite3 ghtorrent $ ght-retrieve-repo mojombo jekyll $
(edit config.yaml) Roll your own tools
Research !
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Single developer identities
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Source tracking
Network analysis
Distributed development Text Text TUD-SERG-2013-10 An Exploratory Study of the
Pull- based Software Development model
None
None
None
None
None
ghtorrent.org Octicons font: courtesy Github