Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A tale of two datasets
Search
Georgios Gousios
September 25, 2013
Technology
0
470
A tale of two datasets
Presentation given at the 2013 ICSM panel on Open Access
Georgios Gousios
September 25, 2013
Tweet
Share
More Decks by Georgios Gousios
See All by Georgios Gousios
NLP + SE = ❤️
gousiosg
0
250
The troubles of modern dependency management and what to do about them
gousiosg
0
480
Mining Repositories with Apache Spark
gousiosg
0
590
My adventures with open everything
gousiosg
0
250
Structure and Evolution of Package Dependency Networks
gousiosg
0
700
Mining Github for fun and profit
gousiosg
9
63k
GitHub Insights: Understanding Open Source
gousiosg
0
330
Work Practices and Challenges in Pull-Based Development: The Contributor’s Perspective
gousiosg
0
880
Big Data in Software Engineering panel and Privacy: Should we care?
gousiosg
0
250
Other Decks in Technology
See All in Technology
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
2
3.2k
Can We Measure Developer Productivity?
ewolff
1
150
Lambda10周年!Lambdaは何をもたらしたか
smt7174
2
110
Engineer Career Talk
lycorp_recruit_jp
0
160
エンジニア人生の拡張性を高める 「探索型キャリア設計」の提案
tenshoku_draft
1
120
dev 補講: プロダクトセキュリティ / Product security overview
wa6sn
1
2.3k
いざ、BSC討伐の旅
nikinusu
2
780
TanStack Routerに移行するのかい しないのかい、どっちなんだい! / Are you going to migrate to TanStack Router or not? Which one is it?
kaminashi
0
580
強いチームと開発生産性
onk
PRO
34
11k
Shopifyアプリ開発における Shopifyの機能活用
sonatard
4
250
Incident Response Practices: Waroom's Features and Future Challenges
rrreeeyyy
0
160
マルチプロダクトな開発組織で 「開発生産性」に向き合うために試みたこと / Improving Multi-Product Dev Productivity
sugamasao
1
300
Featured
See All Featured
Designing Dashboards & Data Visualisations in Web Apps
destraynor
229
52k
Building Better People: How to give real-time feedback that sticks.
wjessup
364
19k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
How to train your dragon (web standard)
notwaldorf
88
5.7k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
48k
Building a Scalable Design System with Sketch
lauravandoore
459
33k
A Philosophy of Restraint
colly
203
16k
Why Our Code Smells
bkeepers
PRO
334
57k
GraphQLとの向き合い方2022年版
quramy
43
13k
No one is an island. Learnings from fostering a developers community.
thoeni
19
3k
Statistics for Hackers
jakevdp
796
220k
Reflections from 52 weeks, 52 projects
jeffersonlam
346
20k
Transcript
A tale of two datasets Georgios Gousios TU Delft
open access
None
None
None
Software Quality Observatory for OSS
None
50k LOC!
750 OSS repositories, SVN, bugs, emails 1.5GB processed data
dump
demo.sqo-oss.org
SQO-OSS facts • 6 partners • ~30 publications • 4
PhDs funded • press releases on project releases • rated excellent by the EC
1 external user 2 external publications
None
GHTorrent
GHTorrent
mirror event stream
<<event>> PushEvent <<api>> /users/:user ensure_user <<api>> /repos/:user/:repo/ ensure_repo <<api>> /repos/:user/:repo/commits
ensure_commits ensure_user <<api>> /:user/:repo/sha ensure_commit ensure_user <<api>> /users/:user/ followers ensure_followers <<api>> /repos/:user/:repo/ commits/:sha/comments ensure_commit_comments <<api>> /users/:user/orgs ensure_orgs <<api>> /orgs/:org/teams ensure_teams recursive dependency retrieval
relational database
repositories users organizations issues /users/:user /user/repos /repos/:user/:repo/issues /orgs/:org {
"type": "User", "public_gists": 0, "login": "gousiosg", "followers": 8, "name": "Georgios Gousios", "public_repos": 4, "created_at": ..., "id": 386172, "following": 4, } { . . . NoSQL database as cache
periodic dumps of DBs online
query DBs online
GHTorrent facts • 2.0 TB in MongoDB, 40GB in MySQL
• 1 developer • 3 papers • advertised on social media • 1.5 years
5 external users 3 external papers MSR14 challenge dataset
why the difference?
but Github is hot!
but Github is hot!
but Github is hot! so was SourceForge, Gnome, KDE etc
but Github is hot! so was SourceForge, Gnome, KDE etc
the Github Archive project offers a subset of the data in an easier to query format
SQO-OSS GHTorrent Tools pluggable platform ruby library and cmd-line tools
Data initially none, then raw repos raw + processed Documentation mostly programming mostly data formats Released after 2 years immediately Attracted first user 2 years 2 months
open source research
aim for lean and mean
infrastructures and platforms are overrated
open now trumps open when it’s done