Slide 1

Slide 1 text

© 2020 Scalar, inc. Transaction Management on Cassandra 2 Feb, 2021 at Cassandra勉強会 Hiroyuki Yamada CTO/CEO at Scalar, Inc. 1

Slide 2

Slide 2 text

© 2020 Scalar, inc. Cassandra @ Scalar • Scalar: 広義の信頼性という観点で、次世代データベースを作っている会社 – Vision: World’s Reliable Database • ⼀つの取り組みとして、Cassandraの拡張・改良を実施 – 機能追加: CassandraをACIDにするライブラリScalar DBの開発 – ⾼速化: 新しいコミットログモード(Group CommitLog)の開発 – 安定化: LWTのJepnsen testsの改良と実施 • ApacheCon 2019,2020 (Cassandra track) で発表 – “Transaction Management on Cassandra” – “Making Cassandra more capable, faster, and more reliable” • 本⽇はApacheConで発表したCassandraをACIDにするScalar DBを紹介 – 当⽇の資料でそのまま発表します! 3

Slide 3

Slide 3 text

© 2020 Scalar, inc. What is Scalar DB • A universal transaction manager – A Java library that makes non-ACID databases ACID-compliant – The architecture is inspired by Deuteronomy [CIDR’09,11] • Cassandra is the first supported database 5 https://github.com/scalar-labs/scalardb

Slide 4

Slide 4 text

© 2020 Scalar, inc. Why ACID Transactions with Cassandra? Why with Scalar DB? • ACID is a must-have feature in some mission-critical applications – C* has been getting widely used for such applications – C* is one of the major open-source distributed databases • Lots of risks and burden for modifying C* – Scalar DB enables ACID transactions without modifying C* at all since it is dependent only on the exposed APIs – No risks for breaking the exiting code 6

Slide 5

Slide 5 text

© 2020 Scalar, inc. Pros and Cons in Scalar DB on Cassandra • Non-invasive – No modifications in C* • High availability and scalability – C* properties are fully sustained by the client- coordinated approach • Flexible deployment – Transaction layer and storage layer can be independently scaled 7 • Slower than NewSQLs – More abstraction layers and storage-oblivious transaction manager • Hard to optimize – Transaction manager has not much information about storage • No CQL support – A transaction has to be written procedurally with a programming language

Slide 6

Slide 6 text

© 2020 Scalar, inc. Programming Interface and System Architecture • CRUD interface – put, get, scan, delete • Begin and commit semantics – Arbitrary number of operations can be handled • Client-coordinated – Transaction code is run in the library – No middleware is managed 8 DistributedTranasctionManager manager = …; DistributedTransaction transaction = manager.start(); Get get = createGet(); Optional result = transaction.get(get); Pub put = createPut(result); transaction.put(put); transaction.commit(); Client programs / Web applications Scalar DB Command execution / HTTP Cassandra

Slide 7

Slide 7 text

© 2020 Scalar, inc. Data Model • Multi-dimensional map [OSDI’06] – (partition-key, clustering-key, value-name) -> value-content – Assumed to be hash partitioned 9

Slide 8

Slide 8 text

© 2020 Scalar, inc. Transaction Management - Overview • Based on Cherry Garcia [ICDE’15] – Two phase commit on linearizable operations (for Atomicity) – Protocol correction and TrueTime avoidance are extended work – Distributed WAL records (for Atomicity and Durability) – Single version optimistic concurrency control (for Isolation) – Serializability support is our extended work • Requirements in underlining databases/storages – Linearizable read and linearizable conditional/CAS write – An ability to store metadata for each record 10

Slide 9

Slide 9 text

© 2020 Scalar, inc. Transaction Commit Protocol (for Atomicity) • Two phase commit protocol on linearizable operations – Similar to Paxos Commit [TODS’06] • The protocol – Prepare phase: prepare records – Commit phase 1: commit status record – This is where a transaction is regarded as committed or aborted – Commit phase 2: commit records • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 11

Slide 10

Slide 10 text

© 2020 Scalar, inc. Distributed WAL (for Atomicity and Durability) • WAL (Write-Ahead Logging) is distributed into records 12 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)

Slide 11

Slide 11 text

© 2020 Scalar, inc. Concurrency Control (for Isolation) • Single version OCC – Simple implementation of Snapshot Isolation – Conflicts are detected by linearizable conditional write (LWT) – No clock dependency, no use of HLC (Hybrid Logical Clock) • Supported isolation level – Read-committed Snapshot Isolation (RCSI) – Read-skew, write-skew, read-only anomalies could happen – Serializable – No anomalies (Strict Serializability) – RCSI-based but non-serializable schedules are aborted 13

Slide 12

Slide 12 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 13

Slide 13 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 14

Slide 14 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 15

Slide 15 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 16

Slide 16 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY

Slide 17

Slide 17 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1

Slide 18

Slide 18 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1

Slide 19

Slide 19 text

© 2020 Scalar, inc. Transaction With Example – Prepare Phase 14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1

Slide 20

Slide 20 text

© 2020 Scalar, inc. Transaction With Example – Commit Phase 1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ Client1 with Tx1 Cassandra

Slide 21

Slide 21 text

© 2020 Scalar, inc. Transaction With Example – Commit Phase 1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write (LWT) Update if the TxID does not exist Client1 with Tx1 Cassandra

Slide 22

Slide 22 text

© 2020 Scalar, inc. Transaction With Example – Commit Phase 2 16 Cassandra UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write (LWT) Update status if the record is prepared by the TxID Client1 with Tx1

Slide 23

Slide 23 text

© 2020 Scalar, inc. Recovery 17 Prepare Phase Commit Phase1 Commit Phase2 TX1 • Recovery is lazily done when a record is read Nothing is needed (local memory space is automatically cleared) Recovery process Rollbacked by another TX lazily using before image Roll-forwarded by another TX lazily updating status to C No need for recovery Crash

Slide 24

Slide 24 text

© 2020 Scalar, inc. Serializable Strategy • Basic strategy – Avoid anti/rw-dependency dangerous structure [TODS’05] – No use of SSI [SIGMOD’08] or its variant [EuroSys’12] – Many linearizable operations for managing in/outConflicts or correct clock are required • Two implementations – Extra-write – Convert read into write – Extra care is done if a record doesn’t exist (Delete the record) – Extra-read – Check read-set after prepared to see if it is not updated by other transactions 18

Slide 25

Slide 25 text

© 2020 Scalar, inc. Benchmark Results with Scalar DB on Cassandra 19 Workload2 (Evidence) Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)

Slide 26

Slide 26 text

© 2020 Scalar, inc. Verification Results for Scalar DB on Cassandra • Scalar DB on Cassandra has been heavily tested with Jepsen and our destructive tools – Jepsen tests are created and conducted by Scalar – See https://github.com/scalar-labs/scalar-jepsen for more detail • Transaction commit protocol is verified with TLA+ – See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit 20 Jepsen Passed TLA+ Passed

Slide 27

Slide 27 text

© 2020 Scalar, inc. Summary • Scalar DBは汎⽤的なトランザクションマネージャ – C*のコード変更なく、C*をACIDにすることが可能 – ⾼いスケラービリティと正確性は検証済み – C*以外のデータベース実装も対応 – Azure Cosmos DB – Amazon DynamoDB – Relational databases (MySQL, PostgreSQL, Oracle, SQL Server) – Coming pretty soon! • Scalar is Hiring (AngelList等で気軽にご連絡ください︕) – 世界最⾼レベルのエンジニアチームを⼀緒に作ってくれるEngineering Manager – ⾼可⽤かつスケーラブルな分散システムを作ることが好きなSRE 13