Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transaction Management on Cassandra

Transaction Management on Cassandra

CassandraCommunityJP

February 12, 2021
Tweet

More Decks by CassandraCommunityJP

Other Decks in Technology

Transcript

  1. © 2020 Scalar, inc. Transaction Management on Cassandra 2 Feb,

    2021 at Cassandra勉強会 Hiroyuki Yamada CTO/CEO at Scalar, Inc. 1
  2. © 2020 Scalar, inc. Cassandra @ Scalar • Scalar: 広義の信頼性という観点で、次世代データベースを作っている会社

    – Vision: World’s Reliable Database • ⼀つの取り組みとして、Cassandraの拡張・改良を実施 – 機能追加: CassandraをACIDにするライブラリScalar DBの開発 – ⾼速化: 新しいコミットログモード(Group CommitLog)の開発 – 安定化: LWTのJepnsen testsの改良と実施 • ApacheCon 2019,2020 (Cassandra track) で発表 – “Transaction Management on Cassandra” – “Making Cassandra more capable, faster, and more reliable” • 本⽇はApacheConで発表したCassandraをACIDにするScalar DBを紹介 – 当⽇の資料でそのまま発表します! 3
  3. © 2020 Scalar, inc. What is Scalar DB • A

    universal transaction manager – A Java library that makes non-ACID databases ACID-compliant – The architecture is inspired by Deuteronomy [CIDR’09,11] • Cassandra is the first supported database 5 https://github.com/scalar-labs/scalardb
  4. © 2020 Scalar, inc. Why ACID Transactions with Cassandra? Why

    with Scalar DB? • ACID is a must-have feature in some mission-critical applications – C* has been getting widely used for such applications – C* is one of the major open-source distributed databases • Lots of risks and burden for modifying C* – Scalar DB enables ACID transactions without modifying C* at all since it is dependent only on the exposed APIs – No risks for breaking the exiting code 6
  5. © 2020 Scalar, inc. Pros and Cons in Scalar DB

    on Cassandra • Non-invasive – No modifications in C* • High availability and scalability – C* properties are fully sustained by the client- coordinated approach • Flexible deployment – Transaction layer and storage layer can be independently scaled 7 • Slower than NewSQLs – More abstraction layers and storage-oblivious transaction manager • Hard to optimize – Transaction manager has not much information about storage • No CQL support – A transaction has to be written procedurally with a programming language
  6. © 2020 Scalar, inc. Programming Interface and System Architecture •

    CRUD interface – put, get, scan, delete • Begin and commit semantics – Arbitrary number of operations can be handled • Client-coordinated – Transaction code is run in the library – No middleware is managed 8 DistributedTranasctionManager manager = …; DistributedTransaction transaction = manager.start(); Get get = createGet(); Optional<Result> result = transaction.get(get); Pub put = createPut(result); transaction.put(put); transaction.commit(); Client programs / Web applications Scalar DB Command execution / HTTP Cassandra
  7. © 2020 Scalar, inc. Data Model • Multi-dimensional map [OSDI’06]

    – (partition-key, clustering-key, value-name) -> value-content – Assumed to be hash partitioned 9
  8. © 2020 Scalar, inc. Transaction Management - Overview • Based

    on Cherry Garcia [ICDE’15] – Two phase commit on linearizable operations (for Atomicity) – Protocol correction and TrueTime avoidance are extended work – Distributed WAL records (for Atomicity and Durability) – Single version optimistic concurrency control (for Isolation) – Serializability support is our extended work • Requirements in underlining databases/storages – Linearizable read and linearizable conditional/CAS write – An ability to store metadata for each record 10
  9. © 2020 Scalar, inc. Transaction Commit Protocol (for Atomicity) •

    Two phase commit protocol on linearizable operations – Similar to Paxos Commit [TODS’06] • The protocol – Prepare phase: prepare records – Commit phase 1: commit status record – This is where a transaction is regarded as committed or aborted – Commit phase 2: commit records • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 11
  10. © 2020 Scalar, inc. Distributed WAL (for Atomicity and Durability)

    • WAL (Write-Ahead Logging) is distributed into records 12 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)
  11. © 2020 Scalar, inc. Concurrency Control (for Isolation) • Single

    version OCC – Simple implementation of Snapshot Isolation – Conflicts are detected by linearizable conditional write (LWT) – No clock dependency, no use of HLC (Hybrid Logical Clock) • Supported isolation level – Read-committed Snapshot Isolation (RCSI) – Read-skew, write-skew, read-only anomalies could happen – Serializable – No anomalies (Strict Serializability) – RCSI-based but non-serializable schedules are aborted 13
  12. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  13. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  14. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  15. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  16. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  17. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1
  18. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1
  19. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1
  20. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ Client1 with Tx1 Cassandra
  21. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write (LWT) Update if the TxID does not exist Client1 with Tx1 Cassandra
  22. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    2 16 Cassandra UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write (LWT) Update status if the record is prepared by the TxID Client1 with Tx1
  23. © 2020 Scalar, inc. Recovery 17 Prepare Phase Commit Phase1

    Commit Phase2 TX1 • Recovery is lazily done when a record is read Nothing is needed (local memory space is automatically cleared) Recovery process Rollbacked by another TX lazily using before image Roll-forwarded by another TX lazily updating status to C No need for recovery Crash
  24. © 2020 Scalar, inc. Serializable Strategy • Basic strategy –

    Avoid anti/rw-dependency dangerous structure [TODS’05] – No use of SSI [SIGMOD’08] or its variant [EuroSys’12] – Many linearizable operations for managing in/outConflicts or correct clock are required • Two implementations – Extra-write – Convert read into write – Extra care is done if a record doesn’t exist (Delete the record) – Extra-read – Check read-set after prepared to see if it is not updated by other transactions 18
  25. © 2020 Scalar, inc. Benchmark Results with Scalar DB on

    Cassandra 19 Workload2 (Evidence) Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)
  26. © 2020 Scalar, inc. Verification Results for Scalar DB on

    Cassandra • Scalar DB on Cassandra has been heavily tested with Jepsen and our destructive tools – Jepsen tests are created and conducted by Scalar – See https://github.com/scalar-labs/scalar-jepsen for more detail • Transaction commit protocol is verified with TLA+ – See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit 20 Jepsen Passed TLA+ Passed
  27. © 2020 Scalar, inc. Summary • Scalar DBは汎⽤的なトランザクションマネージャ – C*のコード変更なく、C*をACIDにすることが可能

    – ⾼いスケラービリティと正確性は検証済み – C*以外のデータベース実装も対応 – Azure Cosmos DB – Amazon DynamoDB – Relational databases (MySQL, PostgreSQL, Oracle, SQL Server) – Coming pretty soon! • Scalar is Hiring (AngelList等で気軽にご連絡ください︕) – 世界最⾼レベルのエンジニアチームを⼀緒に作ってくれるEngineering Manager – ⾼可⽤かつスケーラブルな分散システムを作ることが好きなSRE 13