ToroDB: All your MongoDB Data Are Belong to SQL

ToroDB: All your MongoDB data are belong to SQL Álvaro
Hernández <[email protected]>

ToroDB About $self and 8Kdata

ToroDB What do Stripe and Buffer have in common? Use
MongoDB for high load OLTP Successful companies They are Web Scale

ToroDB What else do they have in common? Both ETL
to a RDBMS for Analytics

ToroDB MongoDB analytics performance: Github dataset { "id": "2489368070", "type":
"PushEvent", "public": true, "created_at": "2015-01-01T00:00:00Z", "actor": { "id": 9152315, "login": "davidjhulse", "gravatar_id": "", "url": "https://api.github.com/users/davidjhulse", "avatar_url": "https://avatars.githubusercontent.com/u/9152315?" }, "repo": { "id": 28635890, "name": "davidjhulse/davesbingrewardsbot", "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot" },

ToroDB MongoDB analytics performance: Github dataset "payload": { "push_id": 536740396,
"size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "before": "86ffa724b4d70fce46e760f8cc080f5ec3d7d85f", "commits": [ { "sha": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "author": { "email": "[email protected]", "name": "davidjhulse" }, "message": "Altered BingBot.jar\n\nFixed issue with multiple account", "distinct": true, "url": "https://api.github.com/repos/..." } ] } }

ToroDB MongoDB analytics performance: Github dataset Milliseconds 0 300.000 600.000
900.000 1.200.000 1.500.000 1.800.000 2.100.000 2.400.000 2.700.000 Github dataset size (GB) 1 GB 10 GB 100 GB MongoDB db.github.ﬁnd({ type: 'PushEvent', 'actor.login': 'davidjhulse' });

ToroDB Why is it so slow? • For most aggregated
queries, the whole collection is scanned for every query. • For every document, many keys and offsets are parsed and computed to ﬁnd the possible keys. Worst case: all keys are scanned. • Different documents with different in the same collection. Mixing apples and oranges!

ToroDB How MongoDB performs an aggregated query

What if we had all our data in a RDBMS?

ToroDB How a RDBMS performs an aggregate query

ToroDB Read I/O required to answer the query db.githubarchive.aggregate([ {
$group: { _id: '$actor.login', events: { $sum: 1 } } }, { $sort: { events: -1 }}, { $limit: 10 } ]) SELECT count(*), login FROM actor GROUP BY login ORDER BY 1 DESC LIMIT 10;

ToroDB Read I/O required to answer the query (“iotop -o
-a”) Github Archive: top 10 actors (1,4GB dataset) Disk Read (MB) 0 125 250 375 500 MongoDB PostgreSQL 87,93 536,36 MongoDB storageSize: 536.37 MB MongoDB size: 1410.35 MB Exactly 100% of the storageSize!

ToroDB What if we use a columnar store?

ToroDB And if we use a columnar store? (“iotop -o
-a”) Github Archive: top 10 actors (1,4GB dataset) Disk Read (MB) 0 125 250 375 500 MongoDB PostgreSQL PostgreSQL + cstore 6,5 87,93 536,36 compressed, only 1 column read

ToroDB What about query performance? Milliseconds 0 43.750 87.500 131.250
175.000 Github dataset size (GB) 1 GB 10 GB 100 GB Milliseconds 0 300.000 600.000 900.000 1.200.000 1.500.000 1.800.000 2.100.000 2.400.000 2.700.000 Github dataset size (GB) 1 GB 10 GB 100 GB PostgreSQL MongoDB

ToroDB What if we take this to the extreme •
What if we do an aggregate query for a 1:N relationship which is empty 99% of the time? • MongoDB will still scan the whole collection • A RDBMS will only scan an almost empty table

ToroDB Show me another benchmark! YASP dataset db.getCollection('player_matches').aggregate([ { $match:
{ 'item_uses.key': 'quelling_blade' } }, { $group: { '_id': null, avg: { $avg: '$level' } } } ]); SELECT AVG(matches.level) FROM uses INNER JOIN matches ON uses.did = matches.did WHERE uses.key = 'quelling_blade';

ToroDB Show me another benchmark! YASP dataset YASP dataset. Avg
level of users using 'quelling_blade' object execution time (ms) 0K 14K 28K 42K 56K 71K 85K 99K 113K 127K MongoDB PostgreSQL 63,36 131.678,49 That's more than 2000x faster!

ToroDB So what about indexes? • For most aggregated queries,
indexes will not be used! • You cannot plan in advance all the queries that will require an index. •Worst, but common use case: its the users who craft the queries they want to make.

ToroDB Show me another benchmark! YASP dataset YASP dataset. Avg
level of users using 'quelling_blade' object execution time (ms) 0 8 16 24 32 MongoDB (Indexed) PostgreSQL (Indexed) 19,65 30,85

ToroDB I’m sold! Gimme SQL • Signiﬁcant performance improvement! •
No connectors required: native SQL! • Design DDL, implement ETL • Not real-time, not HA Maybe not that fast…

ToroDB ToroDB: MongoDB to relational

ToroDB ToroDB for MongoDB Analytics • No need to design
the SQL schema: •ToroDB automagically does it for you! •Real-time: insert in MongoDB, automatically shows up in ToroDB. Even if tables need created. •Native SQL: your data ends in a RDBMS like PostgreSQL

ToroDB ToroDB works as a secondary on a replica set
ToroD this is your SQL replica!

ToroDB ToroDB performance: yet another example :) Github Archive 10GB
dataset Seconds 0 6 12 18 24 30 36 42 48 54 60 MongoDB ToroDB on PostgreSQLToroDB on Greenplum 2,46 7,901 58,519

ToroDB ToroDB main characteristics • Works as a MongoDB secondary
node •No need to run drdl or deﬁne schema. Accepts any input document, even with type conﬂicts •Query w/ native SQL (PostgreSQL as of today) •Open source!

ToroDB ToroDB vs MongoDB BI Connector ToroDB Mongo BI C
License Open Source Proprietary Query Language PostgreSQL PostgreSQL (v1) or reduced SQL set (v2) Performance 100x faster than MongoDB 100x slower plus big (v1) or smaller (v2) connector overhead NoSQL to SQL transformation Once (insert time) Many (per query) Distributed analytics With Greenplum or CitusDB No Columnar store and compression Yes No

ToroDB ToroDB main use cases Native SQL BI Connector Native
SQL BI Connector Data Integration Platform: SQL and NoSQL apps in the same RDBMS Live MongoDB to RDBMS migration Apps: Write data with Mongo API, query with SQL!

ToroDB Last but deﬁnitely not least BEGIN TRANSACTION ISOLATION LEVEL
REPEATABLE READ READ ONLY; Consistent reads!

ToroDB ToroDB: coming to Github soon!

ToroDB Rate My Session!

Let’s Talk! Ediﬁcio 4B - Loft 33 Avda. Fuencarral, 44
Campus Empresarial Tribeca 28108 Alcobendas, Madrid (SPAIN) (+34) 91 867 55 54

ToroDB: All your MongoDB Data Are Belong to SQL

ToroDB: All your MongoDB Data Are Belong to SQL

More Decks by 8Kdata

Other Decks in Programming

Featured

Transcript