Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2020)

Making Cassandra more capable, faster, and more reliable (at ApacheCon@Home 2020)

Cassandra is widely adopted in real-world applications and used by large and sometimes mission-critical applications because of its high performance, high availability and high scalability. However, there is still some room for improvement to take Cassandra to the next level. We have been contributing to Cassandra to make it more capable, faster, and more reliable by, for example, proposing a non-invasive ACID transaction library, adding GroupCommitLogService, and maintaining and conducting Jepsen testing for lightweight transactions. This talk will present the contributions we have done including the latest updates in more detail, and the reasons why we made such contributions. This talk will be one of the good starting points for discussing the next generation Cassandra.

Scalar, Inc.

October 01, 2020
Tweet

More Decks by Scalar, Inc.

Other Decks in Technology

Transcript

  1. Making Cassandra more capable, faster, and more reliable Hiroyuki Yamada

    – CTO/CEO at Scalar, Inc. Yuji Ito – Architect at Scalar, Inc. APACHECON @HOME Sep, 29th – Oct. 1st 2020
  2. © 2020 Scalar, inc. Speakers • Hiroyuki Yamada – CTO

    at Scalar, Inc. – Passionate about Database Systems and Distributed Systems – Ph.D. in Computer Science, the University of Tokyo – Formerly IIS the University of Tokyo, Yahoo! Japan, IBM Japan 2 • Yuji Ito – Architect at Scalar, Inc. – Improve the performance and the reliability of Scalar DLT – Love failure analysis – Formerly an SSD firmware engineer at Fixstars, Hitachi
  3. © 2020 Scalar, inc. Cassandra @ Scalar • Scalar tries

    to make Cassandra the next level – More capable: ACID transactions with Scalar DB – Faster: Group CommitLog Sync – More reliable: Jepsen tests for LWT • This talk will present why we do them and what we do 3
  4. © 2020 Scalar, inc. What is Scalar DB • A

    universal transaction manager – A Java library that makes non-ACID databases ACID-compliant – The architecture is inspired by Deuteronomy [CIDR’09,11] • Cassandra is the first supported database 5 https://github.com/scalar-labs/scalardb
  5. © 2020 Scalar, inc. Why ACID Transactions with Cassandra? Why

    with Scalar DB? • ACID is a must-have feature in some mission-critical applications – C* has been getting widely used for such applications – C* is one of the major open-source distributed databases • Lots of risks and burden for modifying C* – Scalar DB enables ACID transactions without modifying C* at all since it is dependent only on the exposed APIs – No risks for breaking the exiting code 6
  6. © 2020 Scalar, inc. Pros and Cons in Scalar DB

    on Cassandra • Non-invasive – No modifications in C* • High availability and scalability – C* properties are fully sustained by the client- coordinated approach • Flexible deployment – Transaction layer and storage layer can be independently scaled 7 • Slower than NewSQLs – More abstraction layers and storage-oblivious transaction manager • Hard to optimize – Transaction manager has not much information about storage • No CQL support – A transaction has to be written procedurally with a programming language
  7. © 2020 Scalar, inc. Programming Interface and System Architecture •

    CRUD interface – put, get, scan, delete • Begin and commit semantics – Arbitrary number of operations can be handled • Client-coordinated – Transaction code is run in the library – No middleware is managed 8 DistributedTranasctionManager manager = …; DistributedTransaction transaction = manager.start(); Get get = createGet(); Optional<Result> result = transaction.get(get); Pub put = createPut(result); transaction.put(put); transaction.commit(); Client programs / Web applications Scalar DB Command execution / HTTP Cassandra
  8. © 2020 Scalar, inc. Data Model • Multi-dimensional map [OSDI’06]

    – (partition-key, clustering-key, value-name) -> value-content – Assumed to be hash partitioned 9
  9. © 2020 Scalar, inc. Transaction Management - Overview • Based

    on Cherry Garcia [ICDE’15] – Two phase commit on linearizable operations (for Atomicity) – Protocol correction is our extended work – Distributed WAL records (for Atomicity and Durability) – Single version optimistic concurrency control (for Isolation) – Serializability support is our extended work • Requirements in underlining databases/storages – Linearizable read and linearizable conditional/CAS write – An ability to store metadata for each record 10
  10. © 2020 Scalar, inc. Transaction Commit Protocol (for Atomicity) •

    Two phase commit protocol on linearizable operations – Similar to Paxos Commit [TODS’06] – Data records are assumed to be distributed • The protocol – Prepare phase: prepare records – Commit phase 1: commit status record – This is where a transaction is regarded as committed or aborted – Commit phase 2: commit records • Lazy recovery – Uncommitted records will be rollforwarded or rollbacked based on the status of a transaction when the records are read 11
  11. © 2020 Scalar, inc. Distributed WAL (for Atomicity and Durability)

    • WAL (Write-Ahead Logging) is distributed into records 12 Application data Transaction metadata After image Before image Application data (Before) Transaction metadata (Before) Status Version TxID Status (before) Version (before) TxID (before) TxID Status Other metadata Status Record in coordinator table User/Application Record in user tables Application data (managed by users) Transaction metadata (managed by Scalar DB)
  12. © 2020 Scalar, inc. Concurrency Control (for Isolation) • Single

    version OCC – Simple implementation of Snapshot Isolation – Conflicts are detected by linearizable conditional write (LWT) – No clock dependency, no use of HLC (Hybrid Logical Clock) • Supported isolation level – Read-committed Snapshot Isolation (RCSI) – Read-skew, write-skew, read-only, phantom anomalies could happen – Serializable – No anomalies (Strict Serializability) – RCSI-based but non-serializable schedules are aborted 13
  13. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  14. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  15. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  16. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  17. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY
  18. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1
  19. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1
  20. © 2020 Scalar, inc. Transaction With Example – Prepare Phase

    14 Client1 Client1’s memory space Cassandra Read Conditional write (LWT) Update only if the versions and the TxIDs are the same as the ones it read Fail due to the condition mismatch UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY 1 80 P 6 Tx1 2 120 P 5 Tx1 Tx1: Transfer 20 from 1 to 2 Client2 UserID Balance Status Version 1 100 C 5 Client2’s memory space Tx2: Transfer 10 from 1 to 2 TxID XXX 2 100 C 4 YYY 1 90 P 6 Tx2 2 110 P 5 Tx2 UserID Balance Status Version 1 100 C 5 TxID XXX 2 100 C 4 YYY P 6 Tx1 P 5 Tx1
  21. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ Client1 with Tx1 Cassandra
  22. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    1 15 UserID Balance Status Version 1 80 P 6 TxID Tx1 2 120 P 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write (LWT) Update if the TxID does not exist Client1 with Tx1 Cassandra
  23. © 2020 Scalar, inc. Transaction With Example – Commit Phase

    2 16 Cassandra UserID Balance Status Version 1 80 C 6 TxID Tx1 2 120 C 5 Tx1 Status C TxID XXX C YYY A ZZZ C Tx1 Conditional write (LWT) Update status if the record is prepared by the TxID Client1 with Tx1
  24. © 2020 Scalar, inc. Recovery 17 Prepare Phase Commit Phase1

    Commit Phase2 TX1 • Recovery is lazily done when a record is read Nothing is needed (local memory space is automatically cleared) Recovery process Rollbacked by another TX lazily using before image Roll-forwarded by another TX lazily updating status to C No need for recovery Crash
  25. © 2020 Scalar, inc. Serializable Strategy • Basic strategy –

    Avoid anti/rw-dependency dangerous structure [TODS’05] – No use of SSI [SIGMOD’08] or its variant [EuroSys’12] – Many linearizable operations for managing in/outConflicts or correct clock are required • Two implementations – Extra-write – Convert read into write – Extra care is done if a record doesn’t exist (Delete the record) – Extra-read – Check read-set after prepared to see if it is not updated by other transactions 18
  26. © 2020 Scalar, inc. Benchmark Results with Scalar DB on

    Cassandra 19 Workload2 (Evidence) Workload1 (Payment) Each node: i3.4xlarge (16 vCPUs, 122 GB RAM, 1900 GB NVMe SSD * 2), RF: 3 • Achieved 90 % scalability in 100-node cluster (Compared to the Ideal TPS based on the performance of 3-node cluster)
  27. © 2020 Scalar, inc. Verification Results for Scalar DB on

    Cassandra • Scalar DB on Cassandra has been heavily tested with Jepsen and our destructive tools – Jepsen tests are created and conducted by Scalar – See https://github.com/scalar-labs/scalar-jepsen for more detail • Transaction commit protocol is verified with TLA+ – See https://github.com/scalar-labs/scalardb/tree/master/tla%2B/consensus-commit 20 Jepsen Passed TLA+ Passed
  28. © 2020 Scalar, inc. Speakers • Hiroyuki Yamada – CTO

    at Scalar, Inc. – Passionate about Database Systems and Distributed Systems – Ph.D. in Computer Science, the University of Tokyo – Formerly IIS the University of Tokyo, Yahoo! Japan, IBM Japan 21 • Yuji Ito – Architect at Scalar, Inc. – Improve the performance and the reliability of Scalar DLT – Love failure analysis – Formerly an SSD firmware engineer at Fixstars, Hitachi
  29. © 2020 Scalar, inc. Why we need a new mode?

    • Scalar DB transaction relies on Cassandra’s – Durability – Performance • Synchronous commitlog sync is required for durability – Periodic mode might lose commitlogs • Commitlog sync performance is the key factor – Batch mode tends to issue lots of IOs 23
  30. © 2020 Scalar, inc. Group CommitLog Sync • New commitlog

    sync mode on 4.0 – https://issues.apache.org/jira/browse/CASSANDRA-13530 • The mode syncs multiple commitlogs at once periodically 24
  31. © 2020 Scalar, inc. Commitlog • Logs of all mutations

    to a Cassandra node – All writes append commitlogs and the mutations are written to the memtable • Recover write data from commitlogs on startup – These data on memtable are gone when crash 25 Commitlog disk memtable Write Commitlog
  32. © 2020 Scalar, inc. Commitlog • Logs of all mutations

    to a Cassandra node – All writes append commitlogs and the mutations are written to the memtable • Recover write data from commitlogs on startup – These data on memtable are gone when crash 26 Commitlog disk memtable Recover
  33. © 2020 Scalar, inc. Commitlog • Logs of all mutations

    to a Cassandra node – All writes append commitlogs and the mutations are written to the memtable • Recover write data from commitlogs on startup – These data on memtable are gone when crash 27 Commitlog disk memtable Write Commitlog
  34. © 2020 Scalar, inc. • Sync commitlogs periodically • NOT

    wait for the completion of the sync (Asynchronous sync) Existing mode: Periodic (default mode) 28 Commitlog disk Commitlog sync thread Sync Request thread ack ack ack ack commitlog_sync_period_in_ms
  35. © 2020 Scalar, inc. • Sync commitlogs periodically • NOT

    wait for the completion of the sync (Asynchronous sync) Þ commitlog(write data) might be lost when crash Existing mode: Periodic (default mode) 29 These commitlogs are lost !! Commitlog disk Commitlog sync thread Request thread commitlog_sync_period_in_ms ack ack ack
  36. © 2020 Scalar, inc. Existing mode: Batch • Sync commitlogs

    immediately – Wait for the completion of the sync (Synchronus sync) – Commitlogs issued at about the same time can be synced together Þ Throughput is degraded due to many small IOs 30 Commitlog disk ack ack ack ack Commitlog sync thread Sync Request thread Sync Sync Sync “commitlog_sync_batch_window_in_ms” is the maximum length of a window, it always syncs immediately
  37. © 2020 Scalar, inc. Issues in the existing modes •

    Periodic – Commitlogs might be lost when Cassandra crashes • Batch – Performance could be degradaded due to many small IOs – Batch doesn’t work as users would expect from the name 31
  38. © 2020 Scalar, inc. Grouping commitlogs • Sync multiple commitlogs

    at once periodically (Synchronus sync) – Reduce IOs by grouping syncs 32 commitlog_sync_group_window_in_ms Commitlog disk ack ack Commitlog sync thread Sync Request thread Sync
  39. © 2020 Scalar, inc. Evaluation • Workload – Small (<<

    1KB) update operations with IF EXISTS (LWT) and without IF EXISTS (non LWT) • Environment 33 Instance type AWS EC2 m4.large Disk type AWS EBS io1 200 IOPS # of nodes 3 Replication factor 3 Window time Batch: 2 ms(default), 10 ms Group: 10 ms, 15 ms
  40. © 2020 Scalar, inc. Evaluation result • Results with 2

    ms and 10 ms batch window are almost the same • Group mode is a bit better than Batch mode – The difference becomes smaller with a faster disk 34 0 500 1000 1500 2000 2500 0 50 100 150 200 250 300 Throughput [operation/sec] Threads Throughput - UPDATE Batch 2ms Batch 10ms Group 10ms Group 15ms 0 20 40 60 80 100 120 140 160 0 200 400 600 800 1000 1200 Average Latency [ms] Throughput [ops] Latency of UPDATE Batch 2ms Group 10ms Group 15ms
  41. © 2020 Scalar, inc. Evaluation result • Between 8 and

    32 threads, the throughput of Group mode is better than that of Batch mode up to 75 % – With LWT, many commitlogs are issued and affect the performance 35 0 20 40 60 80 100 120 140 160 0 200 400 600 800 1000 1200 Average Latency [ms] Throughput [ops] Latency of UPDATE Batch 2ms Group 10ms Group 15ms 0 50 100 150 200 250 300 350 0 10 20 30 40 Throughput [operation/sec] Threads Throughput - UPDATE (Low concurrency) Batch 2ms Batch 10ms Group 10ms Group 15ms 75 %
  42. © 2020 Scalar, inc. Evaluation result • Without LWT, the

    latency of Batch mode is better than that of Group mode in small requests 36 0 5 10 15 20 25 0 200 400 600 800 1000 1200 Average Latency [ms] Throughput [ops] Latency of UPDATE without LWT Batch 2ms Group 15ms
  43. © 2020 Scalar, inc. When to use Group mode? •

    When durability is required • When commitlog disk IOPS is lower than request arrival rate – Group mode can remedy latency increase due to IO saturation 37
  44. © 2020 Scalar, inc. Why we do Jepsen test for

    LWT? • Scalar DB transaction relies on on the “correctness” of LWT – Jepsen can check the correctness (linearizability) • The existing Jepsen test for Cassandra has not been maintained • https://github.com/riptano/jepsen • Last commit: Feb 3, 2016 39
  45. © 2020 Scalar, inc. Jepsen tests for Cassandra • Our

    tests have LWT, Batch, Set, Map, and Counter with various faults 40 DB DB DB DB DB Join/Leave/Rejoin DB DB DB DB DB DB Network faults (Bridge, Isolation, Halves) DB DB DB DB DB Node crash DB DB DB DB DB Clock drift
  46. © 2020 Scalar, inc. Our contributions to Jepsen testing for

    Cassandra • Replaced Cassaforte with Alia (Clojure wrapper for Cassandra) – Cassaforte has not been maintained – There seems a bug in getting results • Rewrote tests with the latest Jepsen – The previous LWT test failed due to OOM – New Jepsen can check the logs by dividing a test to some parts 41
  47. © 2020 Scalar, inc. Our contributions to Jepsen testing for

    Cassandra • Report the result of short tests when a new version is released – 1 minute per test – Without fault injection • Run tests with fault injection for 4.0 beta every week – Sometimes, a node can not join the cluster before testing – This issue didn’t happen with 4.0 alpha 42 jepsen@node0:~$ sudo /root/cassandra/bin/nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.0.1.7 978.53 KiB 256 ? b7713da3-2ac6-4f10-bea0-6374f23b907a rack1 UN 10.0.1.9 1003.29 KiB 256 ? c5c961fa-b585-41a0-ad19-1c51590ccfb0 rack1 UN 10.0.1.8 975.07 KiB 256 ? 981dd1aa-fd12-472e-9fb6-41d24470716e rack1 UJ 10.0.1.4 182.66 KiB 256 ? 9cc222d5-ba45-4e61-ac2d-b42a31cb74b1 rack1
  48. © 2020 Scalar, inc. [Discussion] Jepsen tests migration • Jepsen

    test is now maintained in https://github.com/scalar- labs/scalar-jepsen • Probably more beneficial to many developers if it is migrated into official Cassandra repo – Thought? 43
  49. © 2020 Scalar, inc. Summary • Scalar has enhanced Cassandra

    from various perspectives – More capable: ACID transactions with Scalar DB – Faster: Group CommitLog Sync – More reliable: Jepsen tests for LWT • They are mainly done without updating the core of C* – Making C* more loosely coupled makes such contributions way easier to do 44