Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
490
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
280
The troubles of modern dependency management and what to do about them
gousiosg
0
520
Mining Repositories with Apache Spark
gousiosg
0
640
My adventures with open everything
gousiosg
0
290
Structure and Evolution of Package Dependency Networks
gousiosg
0
760
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
360
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
910
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
280
Other Decks in Technology
See All in Technology
Node-RED × MCP 勉強会 vol.1
1ftseabass
PRO
0
140
生成AI活用の組織格差を解消する 〜ビジネス職のCursor導入が開発効率に与えた好循環〜 / Closing the Organizational Gap in AI Adoption
upamune
3
2.1k
Oracle Cloud Infrastructure:2025年6月度サービス・アップデート
oracle4engineer
PRO
2
240
20250625 Snowflake Summit 2025活用事例 レポート / Nowcast Snowflake Summit 2025 Case Study Report
kkuv
1
310
Yamla: Rustでつくるリアルタイム性を追求した機械学習基盤 / Yamla: A Rust-Based Machine Learning Platform Pursuing Real-Time Capabilities
lycorptech_jp
PRO
3
120
なぜ私はいま、ここにいるのか? #もがく中堅デザイナー #プロダクトデザイナー
bengo4com
0
430
TechLION vol.41~MySQLユーザ会のほうから来ました / techlion41_mysql
sakaik
0
180
Lambda Web Adapterについて自分なりに理解してみた
smt7174
3
110
Fabric + Databricks 2025.6 の最新情報ピックアップ
ryomaru0825
1
140
250627 関西Ruby会議08 前夜祭 RejectKaigi「DJ on Ruby Ver.0.1」
msykd
PRO
2
290
Github Copilot エージェントモードで試してみた
ochtum
0
100
“社内”だけで完結していた私が、AWS Community Builder になるまで
nagisa53
1
390
Featured
See All Featured
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.7k
Build The Right Thing And Hit Your Dates
maggiecrowley
36
2.8k
How to Ace a Technical Interview
jacobian
277
23k
Embracing the Ebb and Flow
colly
86
4.7k
A better future with KSS
kneath
239
17k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.8k
KATA
mclloyd
29
14k
How GitHub (no longer) Works
holman
314
140k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
790
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
17
940
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done