Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
550
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
330
The troubles of modern dependency management and what to do about them
gousiosg
0
660
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
350
Structure and Evolution of Package Dependency Networks
gousiosg
0
870
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
970
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
Other Decks in Technology
See All in Technology
AI時代のSaaSとETL
shoe116
1
140
(Test) ai-meetup slide creation
oikon48
2
350
ナレッジワークのご紹介(第88回情報処理学会 )
kworkdev
PRO
0
200
楽しく学ぼう!ネットワーク入門
shotashiratori
4
3.2k
Yahoo!ショッピングのレコメンデーション・システムにおけるML実践の一例
lycorptech_jp
PRO
1
200
Claude Code Skills 勉強会 (DevelersIO向けに調整済み) / claude code skills for devio
masahirokawahara
1
19k
決済サービスを支えるElastic Cloud - Elastic Cloudの導入と推進、決済サービスのObservability
suzukij
2
630
わたしがセキュアにAWSを使えるわけないじゃん、ムリムリ!(※ムリじゃなかった!?)
cmusudakeisuke
1
710
Oracle Cloud Infrastructure IaaS 新機能アップデート 2025/12 - 2026/2
oracle4engineer
PRO
0
110
AIエージェント時代に備える AWS Organizations とアカウント設計
kossykinto
3
910
PMとしての意思決定とAI活用状況について
lycorptech_jp
PRO
0
120
[E2]CCoEはAI指揮官へ。Bedrock×MCPで構築するコスト・セキュリティ自律運用基盤
taku1418
0
150
Featured
See All Featured
Statistics for Hackers
jakevdp
799
230k
The Cost Of JavaScript in 2023
addyosmani
55
9.8k
A better future with KSS
kneath
240
18k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
76
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
200
A Tale of Four Properties
chriscoyier
163
24k
Un-Boring Meetings
codingconduct
0
220
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
91
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.1k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
64
53k
Making Projects Easy
brettharned
120
6.6k
Rebuilding a faster, lazier Slack
samanthasiow
85
9.4k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done