Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
490
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
280
The troubles of modern dependency management and what to do about them
gousiosg
0
520
Mining Repositories with Apache Spark
gousiosg
0
640
My adventures with open everything
gousiosg
0
290
Structure and Evolution of Package Dependency Networks
gousiosg
0
760
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
360
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
910
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
280
Other Decks in Technology
See All in Technology
Clineを含めたAIエージェントを 大規模組織に導入し、投資対効果を考える / Introducing AI agents into your organization
i35_267
4
1.4k
変化する開発、進化する体系時代に適応するソフトウェアエンジニアの知識と考え方(JaSST'25 Kansai)
mizunori
0
140
Agentic Workflowという選択肢を考える
tkikuchi1002
1
390
Uniadex__公開版_20250617-AIxIoTビジネス共創ラボ_ツナガルチカラ_.pdf
iotcomjpadmin
0
150
第9回情シス転職ミートアップ_テックタッチ株式会社
forester3003
0
140
Oracle Cloud Infrastructure:2025年6月度サービス・アップデート
oracle4engineer
PRO
1
140
キャディでのApache Iceberg, Trino採用事例 -Apache Iceberg and Trino Usecase in CADDi--
caddi_eng
0
170
AI技術トレンド勉強会 #1MCPの基礎と実務での応用
nisei_k
1
240
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
53
32k
Observability infrastructure behind the trillion-messages scale Kafka platform
lycorptech_jp
PRO
0
130
Workflows から Agents へ ~ 生成 AI アプリの成長過程とアプローチ~
belongadmin
3
170
Navigation3でViewModelにデータを渡す方法
mikanichinose
0
210
Featured
See All Featured
The Cost Of JavaScript in 2023
addyosmani
51
8.4k
Building a Modern Day E-commerce SEO Strategy
aleyda
41
7.3k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.4k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
137
34k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
228
22k
Optimising Largest Contentful Paint
csswizardry
37
3.3k
For a Future-Friendly Web
brad_frost
179
9.8k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.5k
Build The Right Thing And Hit Your Dates
maggiecrowley
36
2.8k
YesSQL, Process and Tooling at Scale
rocio
173
14k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done