Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
490
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
290
The troubles of modern dependency management and what to do about them
gousiosg
0
540
Mining Repositories with Apache Spark
gousiosg
0
660
My adventures with open everything
gousiosg
0
300
Structure and Evolution of Package Dependency Networks
gousiosg
0
780
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
370
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
920
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
290
Other Decks in Technology
See All in Technology
DroidKaigi 2025 Androidエンジニアとしてのキャリア
mhidaka
2
250
COVESA VSSによる車両データモデルの標準化とAWS IoT FleetWiseの活用
osawa
1
290
Android Audio: Beyond Winning On It
atsushieno
0
300
エラーとアクセシビリティ
schktjm
1
1.3k
Autonomous Database - Dedicated 技術詳細 / adb-d_technical_detail_jp
oracle4engineer
PRO
4
10k
La gouvernance territoriale des données grâce à la plateforme Terreze
bluehats
0
180
新アイテムをどう使っていくか?みんなであーだこーだ言ってみよう / 20250911-rpi-jam-tokyo
akkiesoft
0
270
未経験者・初心者に贈る!40分でわかるAndroidアプリ開発の今と大事なポイント
operando
5
630
共有と分離 - Compose Multiplatform "本番導入" の設計指針
error96num
2
560
要件定義・デザインフェーズでもAIを活用して、コミュニケーションの密度を高める
kazukihayase
0
110
dbt開発 with Claude Codeのためのガードレール設計
10xinc
2
1.2k
人工衛星のファームウェアをRustで書く理由
koba789
15
7.9k
Featured
See All Featured
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.6k
Build your cross-platform service in a week with App Engine
jlugia
231
18k
The Art of Programming - Codeland 2020
erikaheidi
56
13k
Being A Developer After 40
akosma
90
590k
Mobile First: as difficult as doing things right
swwweet
224
9.9k
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.1k
Faster Mobile Websites
deanohume
309
31k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
126
53k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.9k
Building Flexible Design Systems
yeseniaperezcruz
328
39k
Typedesign – Prime Four
hannesfritz
42
2.8k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done