Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
500
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
550
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
310
Structure and Evolution of Package Dependency Networks
gousiosg
0
790
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
380
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
930
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
300
Other Decks in Technology
See All in Technology
Codexとも仲良く。CodeRabbit CLIの紹介
moongift
PRO
0
190
LLMアプリの地上戦開発計画と運用実践 / 2025.10.15 GPU UNITE 2025
smiyawaki0820
1
470
プロポーザルのコツ ~ Kaigi on Rails 2025 初参加で3名の登壇を実現 ~
naro143
1
220
HR Force における DWH の併用事例 ~ サービス基盤としての BigQuery / 分析基盤としての Snowflake ~@Cross Data Platforms Meetup #2「BigQueryと愉快な仲間たち」
ryo_suzuki
0
130
自動テストのコストと向き合ってみた
qa
1
220
オープンソースでどこまでできる?フォーマル検証チャレンジ
msyksphinz
0
130
OpenAI gpt-oss ファインチューニング入門
kmotohas
2
1.2k
能登半島地震において デジタルができたこと・できなかったこと
ditccsugii
0
150
Vibe Coding Year in Review. From Karpathy to Real-World Agents by Niels Rolland, CEO Paatch
vcoisne
0
130
Reflections of AI: A Trilogy in Four Parts (GOTO; Copenhagen 2025)
ondfisk
0
110
サイバーエージェント流クラウドコスト削減施策「みんなで金塊堀太郎」
kurochan
1
320
エンタメとAIのための3Dパラレルワールド構築(GPU UNITE 2025 特別講演)
pfn
PRO
0
260
Featured
See All Featured
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
Writing Fast Ruby
sferik
629
62k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
114
20k
The Power of CSS Pseudo Elements
geoffreycrofte
79
6k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3.1k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4k
Git: the NoSQL Database
bkeepers
PRO
431
66k
KATA
mclloyd
32
15k
Thoughts on Productivity
jonyablonski
70
4.9k
What's in a price? How to price your products and services
michaelherold
246
12k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
The World Runs on Bad Software
bkeepers
PRO
72
11k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done