Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transaction Management on Cassandra

Scalar, Inc.
September 22, 2019

Transaction Management on Cassandra

Scalar DB is an open source library released under Apache 2 which realizes ACID-compliant transactions on Cassandra, without requiring any modifications to Cassandra itself. It achieves strongly-consistent, linearly scalable, and highly available transactions. This talk will present the theory and practice behind Scalar DB, as well as providing some benchmark results and use cases.


Scalar, Inc.

September 22, 2019

More Decks by Scalar, Inc.

Other Decks in Technology


  1. Transaction Management on Cassandra 10 Sep, 2019 at ApacheCon NA

    2019 Hiroyuki Yamada CTO/CEO at Scalar, Inc. 1
  2. © 2019 Scalar, inc. Who am I ? • Hiroyuki

    Yamada – Passionate about Database Systems and Distributed Systems – Ph.D. in Computer Science, the University of Tokyo – IIS the University of Tokyo, Yahoo! Japan, IBM Japan 2
  3. © 2019 Scalar, inc. Agenda • What is Scalar DB

    • Transaction Management on Cassandra • Benchmark and Verification Results 3
  4. © 2019 Scalar, inc. Maybe You Don’t Need ACID Transactions

    • ACID transactions are heavy – Especially when data is distributed • One of other solutions: – Make operations idempotent and retry them if atomicity is not required 4
  5. © 2019 Scalar, inc. What is Scalar DB • A

    library that makes non-ACID distributed databases ACID-compliant – Cassandra is the first supported distributed database 5 https://github.com/scalar-labs/scalardb Transaction management, Recovery management, Java API
  6. © 2019 Scalar, inc. System Architecture with Scalar DB and

    Cassandra 6 Cassandra nodes Achieves one-copy Serializable Transactions Web Applications Client programs Scalar DB DataStax Java Driver End Users HTTP Command execution Key key = new Key( new TextValue(“id”, ”1”)); Result result = db.get(new Get(key)); // do something with result Put put = new Put(key).(…); db.put(put) Scalar DB
  7. © 2019 Scalar, inc. Key Characteristics • Non-invasive approach –

    Any modifications to the underlying database are not required • High availability – Available as long as quorum of replicas are up – C* high availability is fully sustained by the client-coordinated approach • Horizontal scalability – Throughput scales linearly – C* high scalability is fully sustained by the client-coordinated approach • Strong Consistency – Replicas updated by transactions are always consistent and up-to-date 7
  8. © 2019 Scalar, inc. Transaction Management on Cassandra - Introduction

    • Based on Cherry Garcia protocol [ICDE’15] – Requires minimum set of features such as linearizable conditional update and the ability to store metadata • Scalar DB is one of the applications of the protocol – Use LWT for Linearizability – Manage transaction metadata in user record space • Implement enhancements – Protocol correction – No use of TrueTime API – Serializable support (SI is the default isolation level) 8
  9. © 2019 Scalar, inc. Transaction Metadata Management • WAL (Write-Ahead

    Logging) records are distributed 9 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)
  10. © 2019 Scalar, inc. Transaction Protocol - Overview • Optimistic

    concurrency control • Similar to 2 phase commit protocol – Prepare phase: prepare records – Commit phase 1: commit status – This is where a transaction is regarded as committed or aborted in normal cases – (Commit phase 2: commit records) • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 10
  11. © 2019 Scalar, inc. Transaction Protocol By Examples – Prepare

    Phase 11 Client1 Client1’s memory space Cassandra Read Atomic conditional update (LWT) Update only if the version is the version I read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  12. © 2019 Scalar, inc. Transaction Protocol By Examples – Commit

    Phase 1 12 Cassandra UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Atomic conditional update (LWT) Update if the TxID does not exist Client1 with Tx1
  13. © 2019 Scalar, inc. Transaction Protocol By Examples – Commit

    Phase 2 13 Cassandra UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Atomic conditional update (LWT) Update status if the record is prepared by the TxID Client1 with Tx1
  14. © 2019 Scalar, inc. Failure Handling by Examples • If

    TX1 fails before prepare phase – Just clear the memory space for TX1 • If TX1 fails after prepare phase and before commit phase 1 (no status is written in Status table) – Another transaction (TX3) reads the records and notices that the records are prepared and there is no status for it – TX3 tries to abort TX1 (TX3 tries to write ABORTED to Status with TX1’s TXID and rolls back the records) – TX1 might be on it’s way to commit status, but only one can win, not both • If TX1 fails (right) after commit phase 1 – Another transaction (TX3) tries to commit the records (rollforward) on behalf of TX1 when TX3 reads the same records as TX1 – TX1 might be on it’s way to commit records, but only one can win, not both 14
  15. © 2019 Scalar, inc. Benchmark Results 15 Workload2 (Evidence) Workload1

    (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)
  16. © 2019 Scalar, inc. Verification Results • Scalar DB has

    been heavily tested with Jepsen and our destructive tools – Note that Jepsen tests are created and conducted by Scalar • It has passed both tests for a long time • See https://github.com/scalar-labs/scalar-jepsen for more detail 16 Jepsen Passed
  17. © 2019 Scalar, inc. Other Contributions for Apache Cassandra from

    Scalar • GroupCommitlogService – Yuji Ito – Group multiple commitlog writes at once – CASSANDRA-13530 • Jepsen tests for Cassandra – Yuji Ito, Craig Pastro – Maintain with the latest Jepsen – Rewrite with Alia clojure driver – https://github.com/scalar-labs/scalar-jepsen • Cassy – A simple and integrated backup tool – Just released under Apache 2 – https://github.com/scalar-labs/cassy 17
  18. © 2019 Scalar, inc. Cassy: A simple and integrated backup

    tool 18 • Required to take a transactionally consistent backup
  19. © 2019 Scalar, inc. Future Work • DataStax driver 4.x

    support • Cassandra 4.x support – Hopefully nothing needs to be done • Other C* compatible databases integration – Scylla DB (waiting for LWT) – Cosmos DB (waiting for LWT) • (HBase adapter) 19
  20. © 2019 Scalar, inc. Optimization • Prepare in deterministic order

    – => First prepare always wins 21 TX1: updating K1 and K2 and K3 TX2: updating K2 and K3 and K4 H: Consistent hashing K1 K2 K3 K2 K3 K4 Always prepare in this order !!! (otherwise, all Txs might abort. Ex. TX1 prepares K1,K2 and TX2 prepares K4,K3 in this order) H(K2) H(K1) H(K3) H(K4)
  21. © 2019 Scalar, inc. Snapshot Isolation • Strong isolation level

    but weaker than Serializable – Similar to “MVCC” – Oracle’s most strict isolation level (it’s called “Serializable”) • Read only sees a snapshot (=> non blocking reads) • Mostly strong enough but there are still some anomalies 22
  22. © 2019 Scalar, inc. Anomalies in Snapshot Isolation • Write

    Skew, Read-Only Transaction • Write skew example: – Account balances: X and Y (assume family account) – Initial state: X=70, Y=80 – Constraint: X + Y > 0 – TX1: X = X – 100, TX2: Y = Y - 100 – H: R1(X0, 70) R2(X0, 70) R1(Y0, 80) R2(Y0, 80)W1(X1, −30)C1 W2(Y2, −20)C2 23 70 80 X0 Y0 70-100 -> -30 80 X Y TX1 70 80-100 -> -20 X Y TX2 Update succeeds without conflict in Snapshot Isolation Ok for the constraint Ok for the constraint Update X Update Y -30 -20
  23. © 2019 Scalar, inc. Serializable Support • Convert all reads

    into writes (writing the same value) in a transaction 24
  24. © 2019 Scalar, inc. Protocol Correction • Commit records by

    non-atomically – => NO!!! • Someone else have already did it and have started and prepared a new transaction 25 TX1 TX2 Start and Prepare A Commit Status Commit A (without knowing A with TX1 is already committed, and A is overwritten by a new TX2) Read A and commit A on behalf of TX1 Start and Prepare A Commit Status Commit A