Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
440
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
220
The troubles of modern dependency management and what to do about them
gousiosg
0
420
Mining Repositories with Apache Spark
gousiosg
0
510
My adventures with open everything
gousiosg
0
210
Structure and Evolution of Package Dependency Networks
gousiosg
0
650
Mining Github for fun and profit
gousiosg
9
62k
GitHub Insights: Understanding Open Source
gousiosg
0
300
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
840
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
220
Other Decks in Technology
See All in Technology
長期間TiDBを使ってきた話 @ 私たちはなぜNewSQLを使うのかTiDB選定5社が語る選定理由と活用LT / Experiences with TiDB Over Time
chibiegg
2
890
推しは推せるときに推せ! プロダクトにフィードバックしていこう
nakasho
0
300
NgRx Signal Store
rainerhahnekamp
0
150
Reducing Cross-Zone Egress at Spotify with Custom gRPC Load Balancing Recap
koh_naga
0
200
APIファーストなプロダクトマネジメントの実践 〜SaaSus Platformでの例〜 / "Practicing API-First Product Management - An Example with SaaSus Platform
oztick139
0
100
「スニダン」開発組織の構造に込めた意図 ~組織作りはパッションや政治ではない!~
rinchsan
3
550
本当のAWS基礎
toru_kubota
0
510
Azure犬駆動開発の記録/GlobalAzureFukuoka2024_20240420
nina01
1
210
地理空間データ可視化・解析・活用ソリューション Pacific Spatial Solutions (PSS)
pacificspatialsolutions
0
190
Postman v10リリース後を振り返る / Looking back at Postman v10 after release
yokawasa
1
160
リテール金融(キャッシュレス・ネット銀行・ネット証券)の競争環境と経済圏
8maki
0
770
ServiceNow Knowledge Learning Rise up
manarobot
0
210
Featured
See All Featured
GraphQLとの向き合い方2022年版
quramy
32
12k
Writing Fast Ruby
sferik
621
60k
Adopting Sorbet at Scale
ufuk
68
8.6k
YesSQL, Process and Tooling at Scale
rocio
164
13k
4 Signs Your Business is Dying
shpigford
175
21k
In The Pink: A Labor of Love
frogandcode
138
21k
Six Lessons from altMBA
skipperchong
21
3k
Happy Clients
brianwarren
92
6.4k
The Language of Interfaces
destraynor
151
23k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
2
3.4k
Design by the Numbers
sachag
274
18k
How GitHub (no longer) Works
holman
304
140k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done