Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Georgios Gousios
September 25, 2013
Technology
0
550
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
320
The troubles of modern dependency management and what to do about them
gousiosg
0
640
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
340
Structure and Evolution of Package Dependency Networks
gousiosg
0
840
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
960
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
330
Other Decks in Technology
See All in Technology
Ruby版 JSXのRuxが気になる
sansantech
PRO
0
160
SchooでVue.js/Nuxtを技術選定している理由
yamanoku
3
160
ブロックテーマ、WordPress でウェブサイトをつくるということ / 2026.02.07 Gifu WordPress Meetup
torounit
0
190
ランサムウェア対策としてのpnpm導入のススメ
ishikawa_satoru
0
210
コンテナセキュリティの最新事情 ~ 2026年版 ~
kyohmizu
2
610
データの整合性を保ちたいだけなんだ
shoheimitani
8
3.2k
日本の85%が使う公共SaaSは、どう育ったのか
taketakekaho
1
230
OpenShiftでllm-dを動かそう!
jpishikawa
0
130
AWS Network Firewall Proxyを触ってみた
nagisa53
1
240
モダンUIでフルサーバーレスなAIエージェントをAmplifyとCDKでサクッとデプロイしよう
minorun365
4
220
私たち準委任PdEは2つのプロダクトに挑戦する ~ソフトウェア、開発支援という”二重”のプロダクトエンジニアリングの実践~ / 20260212 Naoki Takahashi
shift_evolve
PRO
1
110
Amazon Bedrock Knowledge Basesチャンキング解説!
aoinoguchi
0
160
Featured
See All Featured
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
94
Facilitating Awesome Meetings
lara
57
6.8k
SEOcharity - Dark patterns in SEO and UX: How to avoid them and build a more ethical web
sarafernandez
0
120
Test your architecture with Archunit
thirion
1
2.2k
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
350
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.8k
Jamie Indigo - Trashchat’s Guide to Black Boxes: Technical SEO Tactics for LLMs
techseoconnect
PRO
0
63
How GitHub (no longer) Works
holman
316
140k
The untapped power of vector embeddings
frankvandijk
1
1.6k
For a Future-Friendly Web
brad_frost
182
10k
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
590
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done