Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
470
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
260
The troubles of modern dependency management and what to do about them
gousiosg
0
490
Mining Repositories with Apache Spark
gousiosg
0
620
My adventures with open everything
gousiosg
0
260
Structure and Evolution of Package Dependency Networks
gousiosg
0
730
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
340
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
900
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
260
Other Decks in Technology
See All in Technology
Aurora PostgreSQLがCloudWatch Logsに 出力するログの課金を削減してみる #jawsdays2025
non97
1
250
Pwned Labsのすゝめ
ken5scal
2
570
IAMのマニアックな話2025
nrinetcom
PRO
6
1.4k
遷移の高速化 ヤフートップの試行錯誤
narirou
6
1.9k
User Story Mapping + Inclusive Team
kawaguti
PRO
3
460
2025/3/1 公共交通オープンデータデイ2025
morohoshi
0
110
ライフステージの変化を乗り越える 探索型のキャリア選択
tenshoku_draft
1
100
大規模アジャイルフレームワークから学ぶエンジニアマネジメントの本質
staka121
PRO
3
1.7k
エンジニア主導の企画立案を可能にする組織とは?
recruitengineers
PRO
1
310
Log Analytics を使った実際の運用 - Sansan Data Hub での取り組み
sansantech
PRO
0
120
役員・マネージャー・著者・エンジニアそれぞれの立場から見たAWS認定資格
nrinetcom
PRO
5
6.8k
スクラムというコンフォートゾーンから抜け出そう!プロジェクト全体に目を向けるインセプションデッキ / Inception Deck for seeing the whole project
takaking22
3
170
Featured
See All Featured
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.5k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
28
9.3k
A better future with KSS
kneath
238
17k
Making the Leap to Tech Lead
cromwellryan
133
9.1k
Code Review Best Practice
trishagee
67
18k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
21
2.5k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Writing Fast Ruby
sferik
628
61k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
32
2.2k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Optimizing for Happiness
mojombo
377
70k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
49k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done