Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ToroDB: All your MongoDB Data Are Belong to SQL

8Kdata
October 04, 2016

ToroDB: All your MongoDB Data Are Belong to SQL

What if you could seamlessly replicate all your MongoDB data to a native SQL database? With NoSQL you can scale out, you don’t need to pre-define your schema, and it maps naturally to the OO world. But how do you perform data discovery? How do you interact with the vast SQL ecosystem, specially the BI tools? How do your SQL-savvy users do without SQL? The answer to most of these questions is usually not a very convincing one. But there are good news: you can bring all your MongoDB data to the SQL world. With ToroDB, an open-source database that participates in a MongoDB cluster. ToroDB acts as a hidden secondary node and replicates in real-time your MongoDB data to a native, SQL database such as PostgreSQL. The process does not involve defining any schema or creating any definition files: ToroDB creates everything needed to represent your MongoDB data as relational tables. Now you can leverage all the SQL ecosystem with your MongoDB data, be it for data migration, connecting to BI connectors, using native SQL tools or simply querying with the powerful SQL language.And using native SQL sharded databases like Greenplum and columnar storage, your aggregate queries, compared to MongoDB’s sharding and aggregation, run up to 100x faster!

8Kdata

October 04, 2016
Tweet

More Decks by 8Kdata

Other Decks in Programming

Transcript

  1. ToroDB What do Stripe and Buffer have in common? Use

    MongoDB for high load OLTP Successful companies They are Web Scale
  2. ToroDB MongoDB analytics performance: Github dataset { "id": "2489368070", "type":

    "PushEvent", "public": true, "created_at": "2015-01-01T00:00:00Z", "actor": { "id": 9152315, "login": "davidjhulse", "gravatar_id": "", "url": "https://api.github.com/users/davidjhulse", "avatar_url": "https://avatars.githubusercontent.com/u/9152315?" }, "repo": { "id": 28635890, "name": "davidjhulse/davesbingrewardsbot", "url": "https://api.github.com/repos/davidjhulse/davesbingrewardsbot" },
  3. ToroDB MongoDB analytics performance: Github dataset "payload": { "push_id": 536740396,

    "size": 1, "distinct_size": 1, "ref": "refs/heads/master", "head": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "before": "86ffa724b4d70fce46e760f8cc080f5ec3d7d85f", "commits": [ { "sha": "a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81", "author": { "email": "[email protected]", "name": "davidjhulse" }, "message": "Altered BingBot.jar\n\nFixed issue with multiple account", "distinct": true, "url": "https://api.github.com/repos/..." } ] } }
  4. ToroDB MongoDB analytics performance: Github dataset Milliseconds 0 300.000 600.000

    900.000 1.200.000 1.500.000 1.800.000 2.100.000 2.400.000 2.700.000 Github dataset size (GB) 1 GB 10 GB 100 GB MongoDB db.github.find({ type: 'PushEvent', 'actor.login': 'davidjhulse' });
  5. ToroDB Why is it so slow? • For most aggregated

    queries, the whole collection is scanned for every query. • For every document, many keys and offsets are parsed and computed to find the possible keys. Worst case: all keys are scanned. • Different documents with different in the same collection. Mixing apples and oranges!
  6. ToroDB Read I/O required to answer the query db.githubarchive.aggregate([ {

    $group: { _id: '$actor.login', events: { $sum: 1 } } }, { $sort: { events: -1 }}, { $limit: 10 } ]) SELECT count(*), login FROM actor GROUP BY login ORDER BY 1 DESC LIMIT 10;
  7. ToroDB Read I/O required to answer the query (“iotop -o

    -a”) Github Archive: top 10 actors (1,4GB dataset) Disk Read (MB) 0 125 250 375 500 MongoDB PostgreSQL 87,93 536,36 MongoDB storageSize: 536.37 MB MongoDB size: 1410.35 MB Exactly 100% of the storageSize!
  8. ToroDB And if we use a columnar store? (“iotop -o

    -a”) Github Archive: top 10 actors (1,4GB dataset) Disk Read (MB) 0 125 250 375 500 MongoDB PostgreSQL PostgreSQL + cstore 6,5 87,93 536,36 compressed, only 1 column read
  9. ToroDB What about query performance? Milliseconds 0 43.750 87.500 131.250

    175.000 Github dataset size (GB) 1 GB 10 GB 100 GB Milliseconds 0 300.000 600.000 900.000 1.200.000 1.500.000 1.800.000 2.100.000 2.400.000 2.700.000 Github dataset size (GB) 1 GB 10 GB 100 GB PostgreSQL MongoDB
  10. ToroDB What if we take this to the extreme •

    What if we do an aggregate query for a 1:N relationship which is empty 99% of the time? • MongoDB will still scan the whole collection • A RDBMS will only scan an almost empty table
  11. ToroDB Show me another benchmark! YASP dataset db.getCollection('player_matches').aggregate([ { $match:

    { 'item_uses.key': 'quelling_blade' } }, { $group: { '_id': null, avg: { $avg: '$level' } } } ]); SELECT AVG(matches.level) FROM uses INNER JOIN matches ON uses.did = matches.did WHERE uses.key = 'quelling_blade';
  12. ToroDB Show me another benchmark! YASP dataset YASP dataset. Avg

    level of users using 'quelling_blade' object execution time (ms) 0K 14K 28K 42K 56K 71K 85K 99K 113K 127K MongoDB PostgreSQL 63,36 131.678,49 That's more than 2000x faster!
  13. ToroDB So what about indexes? • For most aggregated queries,

    indexes will not be used! • You cannot plan in advance all the queries that will require an index. •Worst, but common use case: its the users who craft the queries they want to make.
  14. ToroDB Show me another benchmark! YASP dataset YASP dataset. Avg

    level of users using 'quelling_blade' object execution time (ms) 0 8 16 24 32 MongoDB (Indexed) PostgreSQL (Indexed) 19,65 30,85
  15. ToroDB I’m sold! Gimme SQL • Significant performance improvement! •

    No connectors required: native SQL! • Design DDL, implement ETL • Not real-time, not HA Maybe not that fast…
  16. ToroDB ToroDB for MongoDB Analytics • No need to design

    the SQL schema: •ToroDB automagically does it for you! •Real-time: insert in MongoDB, automatically shows up in ToroDB. Even if tables need created. •Native SQL: your data ends in a RDBMS like PostgreSQL
  17. ToroDB ToroDB works as a secondary on a replica set

    ToroD this is your SQL replica!
  18. ToroDB ToroDB performance: yet another example :) Github Archive 10GB

    dataset Seconds 0 6 12 18 24 30 36 42 48 54 60 MongoDB ToroDB on PostgreSQLToroDB on Greenplum 2,46 7,901 58,519
  19. ToroDB ToroDB main characteristics • Works as a MongoDB secondary

    node •No need to run drdl or define schema. Accepts any input document, even with type conflicts •Query w/ native SQL (PostgreSQL as of today) •Open source!
  20. ToroDB ToroDB vs MongoDB BI Connector ToroDB Mongo BI C

    License Open Source Proprietary Query Language PostgreSQL PostgreSQL (v1) or reduced SQL set (v2) Performance 100x faster than MongoDB 100x slower plus big (v1) or smaller (v2) connector overhead NoSQL to SQL transformation Once (insert time) Many (per query) Distributed analytics With Greenplum or CitusDB No Columnar store and compression Yes No
  21. ToroDB ToroDB main use cases Native SQL BI Connector Native

    SQL BI Connector Data Integration Platform: SQL and NoSQL apps in the same RDBMS Live MongoDB to RDBMS migration Apps: Write data with Mongo API, query with SQL!
  22. Let’s Talk! Edificio 4B - Loft 33 Avda. Fuencarral, 44

    Campus Empresarial Tribeca 28108 Alcobendas, Madrid (SPAIN) (+34) 91 867 55 54