Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Georgios Gousios
September 25, 2013
Technology
570
0
Share
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
350
The troubles of modern dependency management and what to do about them
gousiosg
0
680
Mining Repositories with Apache Spark
gousiosg
0
710
My adventures with open everything
gousiosg
0
350
Structure and Evolution of Package Dependency Networks
gousiosg
0
900
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
440
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
970
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
340
Other Decks in Technology
See All in Technology
AIが自律的に働く時代へ Amazon Quick で実現するAIエージェント紹介
koheiyoshikawa
0
190
カオナビに Suspenseを導入するまで / The Road to Suspense at kaonavi
kaonavi
1
430
需要創出(Chatwork)×供給(BPaaS) フライホイールとMoat 実行能力の最適配置とAI戦略
kubell_hr
0
2.1k
独断と偏見で試してみる、 シングル or マルチエージェント どっちがいいの?
shichijoyuhi
1
240
AI 時代の Platform Engineering
recruitengineers
PRO
1
100
雑談は、センサーだった
bitkey
PRO
2
210
Google Cloud Next '26 の裏でこっそりリリースされたCloud Number Registry & Cloud Hub コスト分析 を試してみた
hikaru1001
0
170
freeeで運用しているAIQAについて
qatonchan
0
270
巨大プラットフォームを進化させる「第3のROI」
recruitengineers
PRO
2
2.5k
変化の激しい時代をゴキゲンに生き抜くために 〜ストレスマネジメントのススメ〜
kakehashi
PRO
4
1.1k
フロントエンドの相手が変わった - AIが加わったWebの新しいインターフェース設計
azukiazusa1
33
11k
ハーネスエンジニアリング入門
hatyibei
0
110
Featured
See All Featured
For a Future-Friendly Web
brad_frost
183
10k
Being A Developer After 40
akosma
91
590k
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
770
Joys of Absence: A Defence of Solitary Play
codingconduct
1
360
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
1
350
What does AI have to do with Human Rights?
axbom
PRO
1
2.1k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
31
3.2k
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.9k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.3k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
290
sira's awesome portfolio website redesign presentation
elsirapls
0
230
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
340
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done