Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalar DB: A library that makes non-ACID databases ACID-compliant

Scalar DB: A library that makes non-ACID databases ACID-compliant

Scalar DB is a library that makes non-ACID databases ACID-compliant. It not only supports strongly-consistent ACID transactions but also scales linearly and achieves high availability when it is deployed with distributed databases such as Cassandra.

Scalar, Inc.

June 25, 2020

More Decks by Scalar, Inc.

Other Decks in Technology


  1. Scalar DB: A library that makes non-ACID databases ACID-compliant 25

    Jun, 2020 at Database Lounge Tokyo #6 Hiroyuki Yamada CTO/CEO at Scalar, Inc. 1
  2. © 2020 Scalar, inc. Who am I ? • Hiroyuki

    Yamada – Passionate about Database Systems and Distributed Systems – Ph.D. in Computer Science, the University of Tokyo – IIS the University of Tokyo, Yahoo! Japan, IBM Japan – https://github.com/feeblefakie 2
  3. © 2020 Scalar, inc. Agenda • Overview • Transaction Management

    • Scalar DB on X • Benchmarking and Verification Results • Summary 3
  4. © 2020 Scalar, inc. What is Scalar DB • A

    universal transaction manager – A Java library that makes non-ACID databases ACID-compliant – The architecture is inspired by Deuteronomy [CIDR’09,11] 4 https://github.com/scalar-labs/scalardb
  5. © 2020 Scalar, inc. Background & Motivation 5 Transaction management,

    Recovery management, Java API • Many distributed databases (NoSQLs) are non-transactional – Cassandra, HBase, Amazon Dynamo DB, Azure Cosmos DB – Scalability and availability are chosen over safety • New scratch-built distributed databases (NewSQLs) are emerging – Cockroach DB, TiDB, YugaByte, FaunaDB, Google Spanner – No sacrifice for safety. ACID is guaranteed – Relatively less matured compared with NoSQLs • Scalar DB is yet another approach to solve the issues
  6. © 2020 Scalar, inc. Pros and Cons • Universal –

    Can make most of non- transactional databases transactional • Non-invasive – Any modifications to the underlying databases are not required • Flexible Scalability – Transaction layer and storage layer can be independently scaled 6 • Slower than NewSQLs – Performance is dependent on underlining databases • Hard to optimize – Transaction manager has no information about storage • No SQL support – A transaction has to be written procedurally with a programming language
  7. © 2020 Scalar, inc. Programming Interface and System Architecture •

    CRUD interface – put, get, scan, delete • Begin and commit semantics – Arbitrary number of operations can be handled • Client-coordinated – Transaction code is run in the library – No middleware is managed 7 DistributedTranasctionManager manager = …; DistributedTransaction transaction = manager.start(); Get get = createGet(); Optional<Result> result = transaction.get(get); Pub put = createPut(result); transaction.put(put); transaction.commit(); Client programs / Web applications Scalar DB Command execution / HTTP Database
  8. © 2020 Scalar, inc. Data Model • Multi-dimensional map [OSDI’06]

    – (partition-key, clustering-key, value-name) -> value-content – Assumed to be hash partitioned 8
  9. © 2020 Scalar, inc. Transaction Management - Overview • Based

    on Cherry Garcia [ICDE’15] – Two phase commit on linearizable operations (for Atomicity) – Protocol correction is our extended work – Distributed WAL records (for Atomicity and Durability) – Single version optimistic concurrency control (for Isolation) – Serializability support is our extended work • Requirements in underlining databases/storages – Linearizable read and linearizable conditional/CAS write – An ability to store metadata for each record 9
  10. © 2020 Scalar, inc. Transaction Commit Protocol (for Atomicity) •

    Two phase commit protocol on linearizable operations – Similar to Paxos Commit [TODS’06] when linearizability is achieved with Paxos – Data records are assumed to be distributed • The protocol – Prepare phase: prepare records – Commit phase 1: commit status record – This is where a transaction is regarded as committed or aborted – Commit phase 2: commit records • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 10
  11. © 2020 Scalar, inc. Distributed WAL (for Atomicity and Durability)

    • WAL (Write-Ahead Logging) records are distributed 11 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)
  12. © 2020 Scalar, inc. Concurrency Control (for Isolation) • Single

    version OCC – Simple implementation of Snapshot Isolation – Conflicts are detected by linearizable conditional write – No clock dependency, no use of HLC (Hybrid Logical Clock) • Supported isolation level – Read-committed Snapshot Isolation (RCSI) – Read-skew, write-skew, read-only, phantom anomalies could happen – Serializable – No anomalies – RCSI-based but non-serializable schedules are aborted 12
  13. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    13 Client1 Client1’s memory space Non-transactional database Read Conditional write Update only if the version and the TxID are the ones I read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  14. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    1 14 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write Update if the TxID does not exist Client1 with Tx1 Non-transactional database
  15. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    2 15 Cassandra UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write Update status if the record is prepared by the TxID Client1 with Tx1
  16. © 2020 Scalar, inc. Recovery With Example • If TX1

    crashes before prepare phase – Just clear the memory space for TX1 • If TX1 crashes after prepare phase and before commit phase 1 (no status is written in Status table) – Another transaction (TX3) reads the records and notices that the records are prepared and there is no status for it – TX3 tries to abort TX1 (TX3 tries to write ABORTED to Status with TX1’s TXID and rolls back the records) – TX1 might be on its way to commit status, but only one can win, not both. • If TX1 crashes (right) after commit phase 1 – Another transaction (TX3) tries to commit the records (rollforward) on behalf of TX1 when TX3 reads the same records as TX1 – TX1 might be on its way to commit records, but only one can win, not both 16
  17. © 2020 Scalar, inc. Serializable Strategy • Basic strategy –

    Avoid anti-dependency dangerous structure [TODS’05] – No use of SSI [SIGMOD’08] or its variant [EuroSys’12] – Many linearizable operations for managing in/outConflicts or correct clock are required • Two implementations – Extra-write – Convert read into write – Extra care is done if a record doesn’t exist (Delete the record) – Extra-read – Check read-set after prepared to see if it is the same as before 17
  18. © 2020 Scalar, inc. Optimization • Prepare in deterministic order

    to avoid starvation – First prepare always wins 18 TX1: updating K1 and K2 and K3 TX2: updating K2 and K3 and K4 H: Consistent hashing K1 K2 K3 K2 K3 K4 Always prepare in this order !!! (otherwise, all Txs might abort. E.g. TX1 prepares K1,K2 and TX2 prepares K4,K3 in this order) H(K2) H(K1) H(K3) H(K4) • Parallel commit – Commit records in parallel for better latency
  19. © 2020 Scalar, inc. Scalar DB on Cassandra • Use

    LWT for linearizable operations – LWT: Paxos implementation of C* • Implemented with DataStax Java Driver – Other CQL compatible storages can also be supported • Highly available and scalable ACID transactions can be achieved – C* characteristics are fully exploited – The resulting commit protocol is similar to Paxos Commit [TODS’06] 19
  20. © 2020 Scalar, inc. Scalar DB on Other Databases •

    ScyllaDB – Works without any modifications (since ScyllaDB talks CQL) – Benchmarked but wasn’t stable at the time (Tested with ScyllaDB 3.2.0) • Phoenix/HBase – PoC version is publicly available – https://github.com/scalar-labs/scalardb-phoenix – Suppl. HBase doesn’t support transaction by itself • Azure Cosmos DB – Under development (plan to release in Sep) – Suppl. Cosmos DB can’t do a multi-partition transaction • AWS Dynamo DB – Under planning (plan to release in Dec) – Suppl. Dynamo DB can’t mix read and write in a transaction 20
  21. © 2020 Scalar, inc. Benchmark Results with Scalar DB on

    Cassandra 21 Workload2 (Evidence) Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)
  22. © 2020 Scalar, inc. Verification Results • Scalar DB has

    been heavily tested with Jepsen and our destructive tools – Jepsen tests are created and conducted by Scalar – See https://github.com/scalar-labs/scalar-jepsen for more detail • Transaction commit protocol is verified with TLA+ – See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit 22 Jepsen Passed TLA+ Passed
  23. © 2020 Scalar, inc. Summary and Future Work • Scalar

    DB is a universal transaction manager – Can make non-ACID databases ACID-compliant – Based on previous research works such as Deuteronomy, Cherry Garcia, Paxos Commit, BigTable • Scalability and Serializability is intensively verified – 90 % scalability is achieved in 100-node cluster – Jepsen and TLA+ are passed • Future work – More extensive verification around Serializability – Integration with other databases than C* 23