Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
550
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
320
The troubles of modern dependency management and what to do about them
gousiosg
0
640
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
340
Structure and Evolution of Package Dependency Networks
gousiosg
0
840
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
960
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
330
Other Decks in Technology
See All in Technology
Amazon S3 Vectorsを使って資格勉強用AIエージェントを構築してみた
usanchuu
3
450
AzureでのIaC - Bicep? Terraform? それ早く言ってよ会議
torumakabe
1
590
ブロックテーマでサイトをリニューアルした話 / 2026-01-31 Kansai WordPress Meetup
torounit
0
470
We Built for Predictability; The Workloads Didn’t Care
stahnma
0
140
SREチームをどう作り、どう育てるか ― Findy横断SREのマネジメント
rvirus0817
0
320
SchooでVue.js/Nuxtを技術選定している理由
yamanoku
3
110
SRE Enabling戦記 - 急成長する組織にSREを浸透させる戦いの歴史
markie1009
0
130
セキュリティについて学ぶ会 / 2026 01 25 Takamatsu WordPress Meetup
rocketmartue
1
310
レガシー共有バッチ基盤への挑戦 - SREドリブンなリアーキテクチャリングの取り組み
tatsukoni
0
220
会社紹介資料 / Sansan Company Profile
sansan33
PRO
15
400k
ファインディの横断SREがTakumi byGMOと取り組む、セキュリティと開発スピードの両立
rvirus0817
1
1.5k
StrandsとNeptuneを使ってナレッジグラフを構築する
yakumo
1
120
Featured
See All Featured
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
240
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Facilitating Awesome Meetings
lara
57
6.8k
Believing is Seeing
oripsolob
1
56
Evolving SEO for Evolving Search Engines
ryanjones
0
130
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
3.6k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
67
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
10
1.1k
Technical Leadership for Architectural Decision Making
baasie
2
250
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.6k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done