Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Georgios Gousios
September 25, 2013
Technology
0
550
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
320
The troubles of modern dependency management and what to do about them
gousiosg
0
640
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
340
Structure and Evolution of Package Dependency Networks
gousiosg
0
840
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
960
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
330
Other Decks in Technology
See All in Technology
超初心者からでも大丈夫!オープンソース半導体の楽しみ方〜今こそ!オレオレチップをつくろう〜
keropiyo
0
110
What happened to RubyGems and what can we learn?
mikemcquaid
0
310
AIエージェントに必要なのはデータではなく文脈だった/ai-agent-context-graph-mybest
jonnojun
0
140
Webhook best practices for rock solid and resilient deployments
glaforge
2
300
FinTech SREのAWSサービス活用/Leveraging AWS Services in FinTech SRE
maaaato
0
130
Amazon Bedrock Knowledge Basesチャンキング解説!
aoinoguchi
0
160
Contract One Engineering Unit 紹介資料
sansan33
PRO
0
13k
広告の効果検証を題材にした因果推論の精度検証について
zozotech
PRO
0
200
量子クラウドサービスの裏側 〜Deep Dive into OQTOPUS〜
oqtopus
0
130
AI駆動開発を事業のコアに置く
tasukuonizawa
1
290
生成AIを活用した音声文字起こしシステムの2つの構築パターンについて
miu_crescent
PRO
3
210
Agent Skils
dip_tech
PRO
0
120
Featured
See All Featured
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
1
100
YesSQL, Process and Tooling at Scale
rocio
174
15k
The Illustrated Children's Guide to Kubernetes
chrisshort
51
51k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
62
50k
Why Your Marketing Sucks and What You Can Do About It - Sophie Logan
marketingsoph
0
76
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
190
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
The Cult of Friendly URLs
andyhume
79
6.8k
The Spectacular Lies of Maps
axbom
PRO
1
520
Reality Check: Gamification 10 Years Later
codingconduct
0
2k
Stop Working from a Prison Cell
hatefulcrawdad
273
21k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done