$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
500
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
300
The troubles of modern dependency management and what to do about them
gousiosg
0
570
Mining Repositories with Apache Spark
gousiosg
0
680
My adventures with open everything
gousiosg
0
320
Structure and Evolution of Package Dependency Networks
gousiosg
0
810
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
400
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
940
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
310
Other Decks in Technology
See All in Technology
業務のトイルをバスターせよ 〜AI時代の生存戦略〜
staka121
PRO
2
120
eBPFとwaruiBPF
sat
PRO
4
2.6k
ブロックテーマとこれからの WordPress サイト制作 / Toyama WordPress Meetup Vol.81
torounit
0
570
WordPress は終わったのか ~今のWordPress の制作手法ってなにがあんねん?~ / Is WordPress Over? How We Build with WordPress Today
tbshiki
1
730
EM歴1年10ヶ月のぼくがぶち当たった苦悩とこれからへ向けて
maaaato
0
280
Power of Kiro : あなたの㌔はパワステ搭載ですか?
r3_yamauchi
PRO
0
110
OCI Oracle Database Services新機能アップデート(2025/09-2025/11)
oracle4engineer
PRO
1
130
[JAWS-UG 横浜支部 #91]DevOps Agent vs CloudWatch Investigations -比較と実践-
sh_fk2
2
250
Reinforcement Fine-tuning 基礎〜実践まで
ch6noota
0
180
Playwright x GitHub Actionsで実現する「レビューしやすい」E2Eテストレポート
kinosuke01
0
590
大企業でもできる!ボトムアップで拡大させるプラットフォームの作り方
findy_eventslides
1
750
新 Security HubがついにGA!仕組みや料金を深堀り #AWSreInvent #regrowth / AWS Security Hub Advanced GA
masahirokawahara
1
1.9k
Featured
See All Featured
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
17k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
359
30k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Into the Great Unknown - MozCon
thekraken
40
2.2k
Unsuck your backbone
ammeep
671
58k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
55
3.1k
Code Review Best Practice
trishagee
74
19k
A designer walks into a library…
pauljervisheath
210
24k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.8k
Agile that works and the tools we love
rasmusluckow
331
21k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
3.2k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done