Next Generation Java MongoDB Compatible NoSQL & SQL Database

ToroDB: Next Generation Java MongoDB compatible NoSQL & SQL Database
Álvaro Hernández <[email protected]>

ToroDB @NoSQLonSQL About *8Kdata* • Research & Development in databases
• Consulting, Training and Support in PostgreSQL • Java Developers, JavaSpecialists.eu, JCrete.org • About myself: CTO at 8Kdata: @ahachete http://linkd.in/1jhvzQ3 www.8kdata.com

ToroDB @NoSQLonSQL Say you want… A database with: • Great
functionality • Consistency (ACID) • Reliability • SQL

ToroDB @NoSQLonSQL … and then also want A database with:
• NoSQL (MongoDB) • “Schema-less” • Scalability

ToroDB @NoSQLonSQL Fear no more! You can now have both
SQL and NoSQL! = +

ToroDB @NoSQLonSQL DEMO!

ToroDB @NoSQLonSQL

ToroDB @NoSQLonSQL ToroDB in one slide • Document-oriented, JSON, NoSQL
db • Open source (AGPL). Written in Java • MongoDB compatibility (wire protocol level) • Uses PostgreSQL as a storage backend

ToroDB @NoSQLonSQL Mapping unstructured data to relational

ToroDB @NoSQLonSQL ToroDB storage internals { "name": "ToroDB", "data": {
"a": 42, "b": "hello world!" }, "nested": { "j": 42, "deeper": { "a": 21, "b": "hello" } } }

ToroDB @NoSQLonSQL ToroDB storage internals The document is split into
the following subdocuments: { "name": "ToroDB", "data": {}, "nested": {} } { "a": 42, "b": "hello world!"} { "j": 42, "deeper": {}} { "a": 21, "b": "hello"}

ToroDB @NoSQLonSQL ToroDB storage internals select * from demo.t_3 ┌─────┬───────┬────────────────────────────┬────────┐
│ did │ index │ _id │ name │ ├─────┼───────┼────────────────────────────┼────────┤ │ 0 │ ¤ │ \x5451a07de7032d23a908576d │ ToroDB │ └─────┴───────┴────────────────────────────┴────────┘ select * from demo.t_1 ┌─────┬───────┬────┬──────────────┐ │ did │ index │ a │ b │ ├─────┼───────┼────┼──────────────┤ │ 0 │ ¤ │ 42 │ hello world! │ │ 0 │ 1 │ 21 │ hello │ └─────┴───────┴────┴──────────────┘ select * from demo.t_2 ┌─────┬───────┬────┐ │ did │ index │ j │ ├─────┼───────┼────┤ │ 0 │ ¤ │ 42 │ └─────┴───────┴────┘

ToroDB @NoSQLonSQL ToroDB storage internals select * from demo.structures ┌─────┬────────────────────────────────────────────────────────────────────────────┐
│ sid │ _structure │ ├─────┼────────────────────────────────────────────────────────────────────────────┤ │ 0 │ {"t": 3, "data": {"t": 1}, "nested": {"t": 2, "deeper": {"i": 1, "t": 1}}} │ └─────┴────────────────────────────────────────────────────────────────────────────┘ select * from demo.root; ┌─────┬─────┐ │ did │ sid │ ├─────┼─────┤ │ 0 │ 0 │ └─────┴─────┘

ToroDB @NoSQLonSQL How data is stored in schema-less Data normalization

ToroDB @NoSQLonSQL This is how we store in ToroDB

ToroDB @NoSQLonSQL Advantages over MongoDB

ToroDB @NoSQLonSQL ToroDB: native SQL

ToroDB @NoSQLonSQL Mix-and-match relational & NoSQL • Use the same
database for both your relational data and ToroDB • Just use separate schemas (if you will) • Don't write to ToroDB data or metadata tables • Query with SQL, do joins, whatever!

ToroDB @NoSQLonSQL Atomic operations • There is no support for
atomic bulk insert/update/delete operations • Not even with $isolated: “Prevents a write operation that affects multiple documents from yielding to other reads or writes […] You can ensure that no client sees the changes until the operation completes or errors out. The $isolated isolation operator does not provide “all-or-nothing” atomicity for write operations.” http://docs.mongodb.org/manual/reference/operator/update/isolated/

ToroDB @NoSQLonSQL “Clean” reads Oh really?

ToroDB @NoSQLonSQL “Clean” reads http://docs.mongodb.org/manual/reference/write-concern/#read-isolation-behavior “MongoDB will allow clients to
read the results of a write operation before the write operation returns.” “If the mongod terminates before the journal commits, even if a write returns successfully, queries may have read data that will not exist after the mongod restarts.” Thus, MongoDB suffers from dirty reads. But let's call them just “tainted reads”.

ToroDB @NoSQLonSQL “Clean” reads What about $snapshot? Nope: “The snapshot()
does not guarantee that the data returned by the query will reflect a single moment in time nor does it provide isolation from insert or delete operations.” http://docs.mongodb.org/manual/faq/developers/#faq-developers-isolate-cursors Cursors in ToroDB run in repeatable read, read-only mode: globalCursorDataSource.setTransactionIsolation("TRANSACTIO N_REPEATABLE_READ"); globalCursorDataSource.setReadOnly(true);

ToroDB @NoSQLonSQL Replication & Horizontal scalability (aka sharding)

ToroDB @NoSQLonSQL ToroDB v0.4 • ToroDB works as a secondary
slave of a MongoDB master (or slave, chained rep) • Implements the full replication protocol (not as an oplog tailable query) • Replicates from Mongo to a PostgreSQL

ToroDB @NoSQLonSQL Write scalability (sharding) • MongoDB's sharding API not
implemented yet (roadmap: ToroDB 0.8) • Will use MongoDB's mongos without modification, as well as config servers • Currently we implement sharding at the db level, using backends such as Greenplum

ToroDB @NoSQLonSQL ToroDB The software

ToroDB @NoSQLonSQL The software Written in Java. v0.40 requires Java7,
1.0 will require 8. Tested with Oracle and IBM JVMs. Anyone from Azul here today? ;) Distributed as a JAR file (actually, wrapped with shell executables). Future: also EAR to deploy.

ToroDB @NoSQLonSQL Standing on the shoulders of giants And also
PostgreSQL, Greenplum, JDBC

ToroDB @NoSQLonSQL Very modular source code • The app: 20+
modules (Maven) • Some of them are individually reusable • Several abstraction layers: ➔ D2R (Document 2 Relational) ➔ KVDocument (KV docs abstraction) ➔ Database/backend (relational)

ToroDB @NoSQLonSQL MongoWP • MongoWP is our implementation of MongoDB's
wire protocol • Based on Netty, an excellent, async and high performance NIO framework • Callback interface for any MongoDB- based “middleware” implementation (ToroDB, proxy...)

ToroDB @NoSQLonSQL Architecture

ToroDB @NoSQLonSQL Executor engines

ToroDB @NoSQLonSQL Performance

ToroDB @NoSQLonSQL • Amazon c3.8xlarge ➔ 32 virtual CPUs ➔
60 GB RAM ➔ 2 x 320 GB SSD • YCSB 0.5.0, only inserts, 10 minutes • WriteConcern {w:1, fsync: true} • Batch size 1000, 1 and 4 threads • MongoDB 3.2 WiredTiger • ToroDB 0.40 Oracle Java 8, PostgreSQL 9.5 (shared_buffers: 15GB, effective_cache: 45GB) OLTP Benchmark: YCSB

ToroDB @NoSQLonSQL OLTP Benchmark: YCSB

ToroDB @NoSQLonSQL • Amazon reviews dataset Image-based recommendations on styles
and substitutes J. McAuley, C. Targett, J. Shi, A. van den Hengel SIGIR, 2015 • AWS c4.xlarge (4vCPU, 8GB RAM) 4KIOPS SSD EBS • 4x shards, 3x config; 4x segments GP • 83M records, 65GB plain json Data Analytics Benchmark

ToroDB @NoSQLonSQL Disk usage Mongo 3.0, WT, Snappy GP columnar,
zlib level 9 table size index size total size 0 10000000000 20000000000 30000000000 40000000000 50000000000 60000000000 70000000000 80000000000 Storage requirements MongoDB vs ToroDB on Greenplum Mongo ToroDB on GP bytes

ToroDB @NoSQLonSQL SELECT count( distinct( "reviewerID" ) ) FROM reviews;
Queries: which one is easier? db.reviews.aggregate([ { $group: { _id: "reviewerID"} }, { $group: {_id: 1, count: { $sum: 1}} } ])

ToroDB @NoSQLonSQL SELECT "reviewerName", count(*) as reviews FROM reviews GROUP
BY "reviewerName" ORDER BY reviews DESC LIMIT 10; Queries: which one is easier? db.reviews.aggregate( [ { $group : { _id : '$reviewerName', r : { $sum : 1 } } }, { $sort : { r : -1 } }, { $limit : 10 } ], {allowDiskUse: true} )

ToroDB @NoSQLonSQL Query times 3 different queries Q3 on MongoDB:
aggregate fails 27.95 74.87 0 0 200 400 600 800 1000 1200 969 1007 0 35 13 31 Query duration (s) MongoDB vs ToroDB on Greenplum MongoDB ToroDB on GP speedup seconds

ToroDB @NoSQLonSQL Tips & Tricks Lessons Learned

ToroDB @NoSQLonSQL • Lots of type and value for each
of those types to manage: strings, integers, Arrays • Lots of case we have to handle in the code: ➔ transformation from document to table data structure ➔ transformation to internal query lang ➔ ... Visitor pattern for document manipulation

ToroDB @NoSQLonSQL • Smaller methods • Somewhere in the deepest
class you can see a huge if {} else if {} ... else {} • Safely add new types Visitor pattern for document manipulation Compiler will tell us if we forget to implement some visitor

ToroDB @NoSQLonSQL Oracle Java Mission Control. • Great tool in
general, low impact on perf. Gives A LOT of information on memory allocation, exceptions thrown, etc • But quite bad to measure the time spent on methods, as it ignores time spent in native code and IO • Very coarse-grained Used tools to monitor performance

ToroDB @NoSQLonSQL VisualVM • Very fine grained • By Default
measures time spent on native code and IO • Impact on performance: ➔ Configurable, but high in general ➔ The performance impact seems to be heterogeneous (some methods are more penalized than others) Used tools to monitor performance

ToroDB @NoSQLonSQL • ToroDB uses HashMaps. Keys are the JSON
keys • When there is a lookup on a HashMap, the equals must be executed. • Each key is a String and String#equals is O(1) when both Strings are the same, but O(n) when both Strings are equal but not the same object. • As a result, we were spending much more time than expected looking for on HashMaps • We use a pool of keys that guarantees that if two keys equal, they are the same object. • Cons: Some time is spent on the pool of keys, as they are basically a map. Document keys & maps

ToroDB @NoSQLonSQL • ToroDB has to deal with memory pressure
when the MongoDB clients produce requests faster than the SQL backend can handle them. • This is specially important when the client is using the async drivers • Ideal solution: Make the backend faster • Specially adding async behaviour • But it requires a new non-JDBC driver => Phoebe • Practical solution: To use a back pressure mechanism to make the client be as fast as the backend can be. Dealing with Memory Pressure

ToroDB @NoSQLonSQL • It is important to monitor the hotspots
• We found some parts of our code that were correct, but very inefficient. ➔ Some of them were errors (some analysis that were executed twice on different parts of the code) ➔ Some operations that we considered faster enough were executed so many times that it was critical to reimplement on a more performant way Chasing performance problems

ToroDB @NoSQLonSQL Download, clone, PR, star it! https://github.com/torodb/torodb Check our
FAQ: https://github.com/torodb/torodb/wiki/FAQ

ToroDB @NoSQLonSQL

Next Generation Java MongoDB Compatible NoSQL &...

Next Generation Java MongoDB Compatible NoSQL & SQL Database

More Decks by 8Kdata

Other Decks in Programming

Featured

Transcript