Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ToroDB Stampede: Open Source BI Connector for MongoDB

8Kdata
April 26, 2017

ToroDB Stampede: Open Source BI Connector for MongoDB

There are many valid reasons to want your MongoDB data to interact with the relational/SQL ecosystem: the humongous amount of tools available; the ability to use the SQL query language; or the need for really fast aggregate queries. However, this process is not easy: it requires designing the schema, which is ever changing; and a batch ETL process, which might require external tools. Enter ToroDB Stampede: a secondary node for your replica set, that automatically replicates the data and transforms it in real-time to a set of relational tables on a RDBMS, which you can query with native SQL. But 100x faster. Want so see the benchmarks? ToroDB Stampede is the open-source solution to MongoDB's BI Connector and is compatible with both MongoDB and Percona Server for MongoDB.

8Kdata

April 26, 2017
Tweet

More Decks by 8Kdata

Other Decks in Programming

Transcript

  1. Who I am ALVARO HERNANDEZ CEO, 8Kdata.com Founder, President Spanish

    Postgres User Group postgrespana.es ~ 750 members • What we do @8Kdata: ✓Creators of ToroDB.com, NoSQL & SQL database ✓Database R&D, product development ✓Training and consulting in PostgreSQL ✓PostgreSQL Support Linkedin: http://es.linkedin.com/in/alvarohernandeztortosa/ Twitter: @ahachete
  2. MongoDB BI Connector •How do they do it? You don’t

    know it. •Harder to detect bugs •They’re part of bigger solutions which may or may not fit you •You need Enterprise version, you are paying for the whole solution
  3. How do we generate the schema? { name: “Alice”, address:

    { street: “Wonderland”, number: 42 }, hobbies: [ “reading”, “talking to rabbits” ] } ★There’s no need to generate it, it’s implicit! implicit schema
  4. How do we generate the schema? •ToroDB analyzes every incoming

    document and separates metadata (schema) from data (tuples) •With that schema info, ToroDB creates 1+ tables per MongoDB collection •ToroDB also creates a RDBMS catalog schema per MongoDB database
  5. How do we generate the schema? did name_s _id_x 1

    Alice \x01d4a63dc6944f37a61cf3d6 SELECT * FROM torodb.people; SELECT * FROM torodb.people_address; did name_s seq number_d street_s 1 1 42 Wonderland SELECT * FROM torodb.people_hobbies; did rid seq value_s 1 1 0 “reading” 1 2 1 “talking to rabbits” database: torodb collection: people { name: “Alice”, _id: (implicit val), address: { street: “Wonderland”, number: 42 }, hobbies: [ “reading”, “talking to rabbits” ] }
  6. How to react to dynamic changes in the schema? [

    { name: “Alice” }, { name: “Harry”, surname: “Potter” } ] ★What happens if a new document in the collection brings new fields?
  7. How to react to dynamic changes in the schema? ★ToroDB

    automatically performs: ✓ALTER TABLE ... ADD COLUMN surname_s text did _id_x name_s surname_s 1 \x01d4a63dc6944f37a61cf3d6 Alice 2 \x21f4f563dc6944f37a61cf436 Harry Potter
  8. How to react to dynamic changes in the schema? [

    { name: “Harry”, surname: “Potter” } { name: “Henry”, surname: 8 }, ] ★What happens if a new document in the collection brings type conflicts?
  9. How to react to dynamic changes in the schema? ★

    Type conflicts are resolved by suffixing column names with the type of the data: "s" for text, "i" for integer and so forth did _id_x name_s surname_s surname_i 1 \x01d4a63dc6944f37a61cf3d6 Harry Potter 2 \x21f4f563dc6944f37a61cf436 Henry 8
  10. ToroDB Architecture ToroDB ToroDB works as a secondary node on

    a MongoDB replica set This is your SQL replica!
  11. ToroDB Architecture • ToroDB works as a secondary node on

    a replica set • This node has two main modes: ✓ First, it imports all MongoDB data into PostgreSQL ✓ Then it stays tuned to changes in MongoDB Oplog and applies them in PostgreSQL
  12. Backends Data2Relational Abstract Doc Protocols MongoDB*, Couchbase, Cassandra… Here is

    where ‘the magic’ happens transforming Docs into Tables and Rows { PostgreSQL*, MySQL, SQL Server, Oracle… ToroDB offers the interface of any NoSQL database, but persisting data in tables and rows within a SQL database. * currently available ToroDB Architecture
  13. Measuring I/O with iotop required to answer a query Github

    Archive: top 10 actors (1,4GB dataset) Disk Read (MB) 0 125 250 375 500 MongoDB PostgreSQL PostgreSQL + cstore 6,5 87,93 536,36 Exactly 100% of the storageSize! MongoDB storageSize: 536.37 MB db.githubarchive.aggregate([ { $group: { _id: '$actor.login', events: { $sum: 1 } } }, { $sort: { events: -1 }}, { $limit: 10 } ]) SELECT count(*), login FROM actor GROUP BY login ORDER BY 1 DESC LIMIT 10; MongoDB size: 1410,35 MB
  14. Building the benchmarks • Data from Github archive • Timing

    three different environments that contain the same data: ✓ MongoDB ✓ MongoDB with 3 shards ✓ PostgreSQL (obtained by migrating MongoDB environment via ToroDB) •We run 6 queries (PostgreSQL queries being the relational equivalent to MongoDB queries) ✓ Available at https://gist.github.com/ahachete
  15. A B C D E F Seconds (Less is better)

    0 4.000 8.000 12.000 16.000 1.021 1.418 3.691 1.191 1.184 621 4.126 4.732 14.504 4.253 4.333 1.851 402 1.650 254 239 83 880 PostgreSQL MongoDB MongoDB (3 Shards) Github 500GB 57x
  16. A B C D E F Seconds (Less is better)

    0 750 1.500 2.250 3.000 117 193 383 163 194 89 664 768 2.405 747 750 361 53 240 9 38 13 34 PostgreSQL MongoDB MongoDB (3 Shards) Github 100GB That’s 267x faster!
  17. Github 100GB no idx Important because many BI queries don’t

    use indexes A B C D E F Seconds (Less is better) 0 200 400 600 800 164 211 158 193 209 199 667 793 658 668 694 715 53 241 7 37 13 31 PostgreSQL MongoDB MongoDB (3 Shards) 94x
  18. Installation Prerequisites ToroDB Stampede • PostgreSQL
 • Java SE 8

    MongoDB connector for BI • MongoDB Enterprise 3.2+
 • OpenSSL
  19. ToroDB Stampede wget https://www.torodb.com/download/torodb-stampede-1.0.0- beta1.tar.bz2 tar xjf torodb-stampede-1.0.0-beta1.tar.bz2 MongoDB connector

    for BI wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64- ubuntu1604-3.4.3.tgz tar xjf mongodb-linux-x86_64-ubuntu1604-3.4.3.tgz sudo install -m755 bin/mongo* /usr/local/bin/ Installation
  20. MongoDB connector for BI • You have to create an

    schema for each database (or collection). mongodrdl --host example.com -d dbname [-c collname] -o schema.drdl • You can edit manually… ... It’s something like this schema: - db: database tables: - table: name_table collection: name_collection pipeline: [] columns: - Name: _id MongoType: bson.ObjectId SqlName: _id SqlType: varchar - Name: column_name MongoType: type SqlName: sqlname SqlType: sqltype Configuration
  21. ToroDB Stampede • Requires a new user (‘torodb') and a

    new database (’torod') in PostgreSQL. • Access configuration to PostgreSQL detailed in .toropass: host:port:torod:torodb:<password> • You can also check the configuration used by ToroDB Stampede using the -l parameter: torodb-stampede -l Configuration
  22. ToroDB Stampede $TOROHOME/bin/torodb-stampede • And it starts the automagically replication

    MongoDB connector for BI mongosqld --schema schema.drdl --mongo-uri <your.mongohost.com> • You need one schema per database • It enables a MySQL connection Start!
  23. ToroDB Stampede • Creates new columns and tables if it’s

    necessary MongoDB connector for BI • You will surely have to edit manually the mongodrdl schema: ➡ mongodrdl infers the schema by sampling Problem resolution Insert Documents with new fields
  24. ToroDB Stampede • The column names contain a postfix, indicating

    the data type ➡ If one path contains two different data types then two different columns are created. MongoDB connector for BI • You should execute mongodrdl again, but surely you don’t have the new type, so: ➡ You have to edit the schema and add a new field with different sql name. Problem resolution Type Conflicts
  25. Problem resolution Array / Document exploding MongoDB Connector for BI

    database: stampede collection: restaurants { _id: (implicit val), name: “Glorious Food” address: { street:”Wonderland”, number: 42, coord: [ -73.9832324, 40.6563633] } } _id name address.street address.number 01d4a63 dc6944f “Glorious Food” “Wonderland” 42 stampede.restaurants stampede.restaurants_address_coord _id address.coord address.coord_idx 01d4a63 dc6944f -73.9832324 0 01d4a63 dc6944f 40.6563633 1
  26. Problem resolution Array / Document exploding ToroDB Stampede did _id_x

    name_s address_e 0 \x01d4a “Glorious Food” f stampede.restaurants stampede.restaurants_address did rid seq street_s number_d coord_e 0 0 “Wonderland” 42 t database: stampede collection: restaurants { _id: (implicit val), name: “Glorious Food” address: { street:”Wonderland”, number: 42, coord: [ -73.9832324, 40.6563633] } } stampede.restaurants_address_coord did rid pid seq v_d 0 0 0 0 -73,9832324 0 1 0 1 40.6563633
  27. Use Case with… MongoDB BI 100.000 rows 82.412ms ( 80x

    !! ) PostgreSQL 100.000 rows 1.008 ms SELECT * FROM twitter.message INNER JOIN twitter.message_user ON twitter.message.did = twitter.message_user.did LIMIT 100000 SELECT * FROM `twitter.message` LIMIT 100000
  28. Use Case With Tableau Top ten users sending messages We

    ran the following query: SELECT screen_name_s, count(1) FROM (SELECT screen_name_s FROM twitter.message_user LIMIT 1000000) AS users GROUP BY 1 ORDER BY 2 desc LIMIT 10 PostgreSQL 3.791 ms MongoDB 7.107 ms (1,8x)
  29. Use Case With Tableau Top user locations We ran the

    with the following query: SELECT location_s, count(1) FROM (SELECT location_s FROM twitter.message_user LIMIT 10000000) AS locations GROUP BY 1 ORDER BY 2 desc LIMIT 20 PostgreSQL 15.452 ms MongoDB 72.536 ms (4,7x)
  30. Use Case With Tableau Top Hashtags We did the with

    the following query: SELECT text_s, count(1) FROM (SELECT text_s FROM twitter.message_entities_hashtags LIMIT 10000000)) AS hashtags GROUP BY 1 ORDER BY 2 desc LIMIT 20 PostgreSQL 4.819 ms MongoDB 305.021 ms (64x !!!)
  31. Summary MongoDB Connector for BI ToroDB Stampede Installation Easy and

    fast Easy and fast Configuration Need to create schema Easy and fast Querying Indirect (convert to NoSQL) Direct (Keeps data synchronized) Adaptability Manual update schema for new fields or types Adapts automagically to changes in schema Query capabilities Limited, requires many self- joins (flattened schema) Advance, schema looks very natural Speed x Up to 100x faster License MongoDB Enterprise Open Source