Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
580
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
370
The troubles of modern dependency management and what to do about them
gousiosg
0
700
Mining Repositories with Apache Spark
gousiosg
0
730
My adventures with open everything
gousiosg
0
360
Structure and Evolution of Package Dependency Networks
gousiosg
0
920
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
450
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
990
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
350
Other Decks in Technology
See All in Technology
AI Testing Talks: Challenges of Applying AI in Software Testing: From Hype to Practical Use
exactpro
PRO
1
140
Oracle AI Database@Google Cloud:サービス概要のご紹介
oracle4engineer
PRO
6
1.5k
サイバーセキュリティ概論 / Introduction to Cybersecurity
ks91
PRO
0
170
製造業のクラウド活用最適解〜AI,DXを加速するデータ基盤の作り方〜
hamadakoji
0
400
GoとSIMDとWasmの今。
askua
3
510
LLMと共に進化するプロセスを目指して
ymatsuwitter
12
3.5k
探して_入れて_作って_使う_Agent_Skills___LT.pdf
peintangos
2
180
【Gen-AX】20260530開催_JJUG CCC 2026 Spring
genax
0
430
AI と創る新たな世界 / A New World Created with AI
ks91
PRO
0
120
Mastering Ruby Box
tagomoris
3
150
AI Adaptable なテストを整える工夫 / Ways to Make Your Tests AI-Adaptable
bitkey
PRO
3
220
トークン数だけでは測れない — Claude Code 組織展開の効果検証から学んだこと
makikub
0
130
Featured
See All Featured
A Modern Web Designer's Workflow
chriscoyier
698
190k
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
200
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
2
390
How to audit for AI Accessibility on your Front & Back End
davetheseo
0
400
The Spectacular Lies of Maps
axbom
PRO
1
790
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
200
Accessibility Awareness
sabderemane
1
130
Mobile First: as difficult as doing things right
swwweet
225
10k
Facilitating Awesome Meetings
lara
57
6.9k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.4k
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
1
340
AI Search: Implications for SEO and How to Move Forward - #ShenzhenSEOConference
aleyda
1
1.3k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done