Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Georgios Gousios
September 25, 2013
Technology
0
550
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
320
The troubles of modern dependency management and what to do about them
gousiosg
0
640
Mining Repositories with Apache Spark
gousiosg
0
700
My adventures with open everything
gousiosg
0
340
Structure and Evolution of Package Dependency Networks
gousiosg
0
840
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
420
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
960
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
330
Other Decks in Technology
See All in Technology
Tebiki Engineering Team Deck
tebiki
0
24k
広告の効果検証を題材にした因果推論の精度検証について
zozotech
PRO
0
200
仕様書駆動AI開発の実践: Issue→Skill→PRテンプレで 再現性を作る
knishioka
2
680
Oracle AI Database移行・アップグレード勉強会 - RAT活用編
oracle4engineer
PRO
0
100
プロダクト成長を支える開発基盤とスケールに伴う課題
yuu26
4
1.4k
OWASP Top 10:2025 リリースと 少しの日本語化にまつわる裏話
okdt
PRO
3
820
クレジットカード決済基盤を支えるSRE - 厳格な監査とSRE運用の両立 (SRE Kaigi 2026)
capytan
6
2.8k
Ruby版 JSXのRuxが気になる
sansantech
PRO
0
160
Why Organizations Fail: ノーベル経済学賞「国家はなぜ衰退するのか」から考えるアジャイル組織論
kawaguti
PRO
1
120
AIと新時代を切り拓く。これからのSREとメルカリIBISの挑戦
0gm
1
2.9k
AIエージェントに必要なのはデータではなく文脈だった/ai-agent-context-graph-mybest
jonnojun
0
140
超初心者からでも大丈夫!オープンソース半導体の楽しみ方〜今こそ!オレオレチップをつくろう〜
keropiyo
0
110
Featured
See All Featured
Google's AI Overviews - The New Search
badams
0
910
JAMstack: Web Apps at Ludicrous Speed - All Things Open 2022
reverentgeek
1
350
Thoughts on Productivity
jonyablonski
74
5k
DevOps and Value Stream Thinking: Enabling flow, efficiency and business value
helenjbeal
1
100
The Language of Interfaces
destraynor
162
26k
Fashionably flexible responsive web design (full day workshop)
malarkey
408
66k
Making Projects Easy
brettharned
120
6.6k
Embracing the Ebb and Flow
colly
88
5k
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
410
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.7k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.8k
Building an army of robots
kneath
306
46k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done