Slide 1

Slide 1 text

Transaction Management on Cassandra 10 Sep, 2019 at ApacheCon NA 2019 Hiroyuki Yamada CTO/CEO at Scalar, Inc. 1

Slide 2

Slide 2 text

© 2019 Scalar, inc. Who am I ? • Hiroyuki Yamada – Passionate about Database Systems and Distributed Systems – Ph.D. in Computer Science, the University of Tokyo – IIS the University of Tokyo, Yahoo! Japan, IBM Japan 2

Slide 3

Slide 3 text

© 2019 Scalar, inc. Agenda • What is Scalar DB • Transaction Management on Cassandra • Benchmark and Verification Results 3

Slide 4

Slide 4 text

© 2019 Scalar, inc. Maybe You Don’t Need ACID Transactions • ACID transactions are heavy – Especially when data is distributed • One of other solutions: – Make operations idempotent and retry them if atomicity is not required 4

Slide 5

Slide 5 text

© 2019 Scalar, inc. What is Scalar DB • A library that makes non-ACID distributed databases ACID-compliant – Cassandra is the first supported distributed database 5 https://github.com/scalar-labs/scalardb Transaction management, Recovery management, Java API

Slide 6

Slide 6 text

© 2019 Scalar, inc. System Architecture with Scalar DB and Cassandra 6 Cassandra nodes Achieves one-copy Serializable Transactions Web Applications Client programs Scalar DB DataStax Java Driver End Users HTTP Command execution Key key = new Key( new TextValue(“id”, ”1”)); Result result = db.get(new Get(key)); // do something with result Put put = new Put(key).(…); db.put(put) Scalar DB

Slide 7

Slide 7 text

© 2019 Scalar, inc. Key Characteristics • Non-invasive approach – Any modifications to the underlying database are not required • High availability – Available as long as quorum of replicas are up – C* high availability is fully sustained by the client-coordinated approach • Horizontal scalability – Throughput scales linearly – C* high scalability is fully sustained by the client-coordinated approach • Strong Consistency – Replicas updated by transactions are always consistent and up-to-date 7

Slide 8

Slide 8 text

© 2019 Scalar, inc. Transaction Management on Cassandra - Introduction • Based on Cherry Garcia protocol [ICDE’15] – Requires minimum set of features such as linearizable conditional update and the ability to store metadata • Scalar DB is one of the applications of the protocol – Use LWT for Linearizability – Manage transaction metadata in user record space • Implement enhancements – Protocol correction – No use of TrueTime API – Serializable support (SI is the default isolation level) 8

Slide 9

Slide 9 text

© 2019 Scalar, inc. Transaction Metadata Management • WAL (Write-Ahead Logging) records are distributed 9 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)

Slide 10

Slide 10 text

© 2019 Scalar, inc. Transaction Protocol - Overview • Optimistic concurrency control • Similar to 2 phase commit protocol – Prepare phase: prepare records – Commit phase 1: commit status – This is where a transaction is regarded as committed or aborted in normal cases – (Commit phase 2: commit records) • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 10

Slide 11

Slide 11 text

© 2019 Scalar, inc. Transaction Protocol By Examples – Prepare Phase 11 Client1 Client1’s memory space Cassandra Read Atomic conditional update (LWT) Update only if the version is the version I read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 12

Slide 12 text

© 2019 Scalar, inc. Transaction Protocol By Examples – Commit Phase 1 12 Cassandra UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Atomic conditional update (LWT) Update if the TxID does not exist Client1 with Tx1

Slide 13

Slide 13 text

© 2019 Scalar, inc. Transaction Protocol By Examples – Commit Phase 2 13 Cassandra UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Atomic conditional update (LWT) Update status if the record is prepared by the TxID Client1 with Tx1

Slide 14

Slide 14 text

© 2019 Scalar, inc. Failure Handling by Examples • If TX1 fails before prepare phase – Just clear the memory space for TX1 • If TX1 fails after prepare phase and before commit phase 1 (no status is written in Status table) – Another transaction (TX3) reads the records and notices that the records are prepared and there is no status for it – TX3 tries to abort TX1 (TX3 tries to write ABORTED to Status with TX1’s TXID and rolls back the records) – TX1 might be on it’s way to commit status, but only one can win, not both • If TX1 fails (right) after commit phase 1 – Another transaction (TX3) tries to commit the records (rollforward) on behalf of TX1 when TX3 reads the same records as TX1 – TX1 might be on it’s way to commit records, but only one can win, not both 14

Slide 15

Slide 15 text

© 2019 Scalar, inc. Benchmark Results 15 Workload2 (Evidence) Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)

Slide 16

Slide 16 text

© 2019 Scalar, inc. Verification Results • Scalar DB has been heavily tested with Jepsen and our destructive tools – Note that Jepsen tests are created and conducted by Scalar • It has passed both tests for a long time • See https://github.com/scalar-labs/scalar-jepsen for more detail 16 Jepsen Passed

Slide 17

Slide 17 text

© 2019 Scalar, inc. Other Contributions for Apache Cassandra from Scalar • GroupCommitlogService – Yuji Ito – Group multiple commitlog writes at once – CASSANDRA-13530 • Jepsen tests for Cassandra – Yuji Ito, Craig Pastro – Maintain with the latest Jepsen – Rewrite with Alia clojure driver – https://github.com/scalar-labs/scalar-jepsen • Cassy – A simple and integrated backup tool – Just released under Apache 2 – https://github.com/scalar-labs/cassy 17

Slide 18

Slide 18 text

© 2019 Scalar, inc. Cassy: A simple and integrated backup tool 18 • Required to take a transactionally consistent backup

Slide 19

Slide 19 text

© 2019 Scalar, inc. Future Work • DataStax driver 4.x support • Cassandra 4.x support – Hopefully nothing needs to be done • Other C* compatible databases integration – Scylla DB (waiting for LWT) – Cosmos DB (waiting for LWT) • (HBase adapter) 19

Slide 20

Slide 20 text

© 2019 Scalar, inc. Questions ? 20

Slide 21

Slide 21 text

© 2019 Scalar, inc. Optimization • Prepare in deterministic order – => First prepare always wins 21 TX1: updating K1 and K2 and K3 TX2: updating K2 and K3 and K4 H: Consistent hashing K1 K2 K3 K2 K3 K4 Always prepare in this order !!! (otherwise, all Txs might abort. Ex. TX1 prepares K1,K2 and TX2 prepares K4,K3 in this order) H(K2) H(K1) H(K3) H(K4)

Slide 22

Slide 22 text

© 2019 Scalar, inc. Snapshot Isolation • Strong isolation level but weaker than Serializable – Similar to “MVCC” – Oracle’s most strict isolation level (it’s called “Serializable”) • Read only sees a snapshot (=> non blocking reads) • Mostly strong enough but there are still some anomalies 22

Slide 23

Slide 23 text

© 2019 Scalar, inc. Anomalies in Snapshot Isolation • Write Skew, Read-Only Transaction • Write skew example: – Account balances: X and Y (assume family account) – Initial state: X=70, Y=80 – Constraint: X + Y > 0 – TX1: X = X – 100, TX2: Y = Y - 100 – H: R1(X0, 70) R2(X0, 70) R1(Y0, 80) R2(Y0, 80)W1(X1, −30)C1 W2(Y2, −20)C2 23 70 80 X0 Y0 70-100 -> -30 80 X Y TX1 70 80-100 -> -20 X Y TX2 Update succeeds without conflict in Snapshot Isolation Ok for the constraint Ok for the constraint Update X Update Y -30 -20

Slide 24

Slide 24 text

© 2019 Scalar, inc. Serializable Support • Convert all reads into writes (writing the same value) in a transaction 24

Slide 25

Slide 25 text

© 2019 Scalar, inc. Protocol Correction • Commit records by non-atomically – => NO!!! • Someone else have already did it and have started and prepared a new transaction 25 TX1 TX2 Start and Prepare A Commit Status Commit A (without knowing A with TX1 is already committed, and A is overwritten by a new TX2) Read A and commit A on behalf of TX1 Start and Prepare A Commit Status Commit A