Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
510
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
310
The troubles of modern dependency management and what to do about them
gousiosg
0
580
Mining Repositories with Apache Spark
gousiosg
0
680
My adventures with open everything
gousiosg
0
320
Structure and Evolution of Package Dependency Networks
gousiosg
0
820
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
400
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
950
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
320
Other Decks in Technology
See All in Technology
フルカイテン株式会社 エンジニア向け採用資料
fullkaiten
0
10k
Redshift認可、アップデートでどう変わった?
handy
1
130
国井さんにPurview の話を聞く会
sophiakunii
1
330
Data Hubグループ 紹介資料
sansan33
PRO
0
2.5k
業務の煩悩を祓うAI活用術108選 / AI 108 Usages
smartbank
9
19k
ソフトウェアエンジニアとAIエンジニアの役割分担についてのある事例
kworkdev
PRO
1
370
スクラムマスターが スクラムチームに入って取り組む5つのこと - スクラムガイドには書いてないけど入った当初から取り組んでおきたい大切なこと -
scrummasudar
1
1.5k
製造業から学んだ「本質を守り現場に合わせるアジャイル実践」
kamitokusari
0
350
善意の活動は、なぜ続かなくなるのか ーふりかえりが"構造を変える判断"になった半年間ー
matsukurou
0
280
ESXi のAIOps だ!2025冬
unnowataru
0
480
歴史から学ぶ、Goのメモリ管理基礎
logica0419
10
2.2k
Cloud WAN MCP Serverから考える新しいネットワーク運用 / 20251228 Masaki Okuda
shift_evolve
PRO
0
140
Featured
See All Featured
Become a Pro
speakerdeck
PRO
31
5.8k
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
2.8k
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
360
Information Architects: The Missing Link in Design Systems
soysaucechin
0
730
Leo the Paperboy
mayatellez
1
1.3k
BBQ
matthewcrist
89
9.9k
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
140
The B2B funnel & how to create a winning content strategy
katarinadahlin
PRO
0
220
Beyond borders and beyond the search box: How to win the global "messy middle" with AI-driven SEO
davidcarrasco
0
34
Agile Leadership in an Agile Organization
kimpetersen
PRO
0
66
A designer walks into a library…
pauljervisheath
210
24k
RailsConf 2023
tenderlove
30
1.3k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done