Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Raft in Scylla

Duarte Nunes
February 02, 2019

Raft in Scylla

This talk will cover the characteristics and requirements of Scylla's Raft implementation, how it enables strongly consistent updates, and how it improves the reliability and safety of internal processes, such as schema changes, node membership, and range movements.

Eventually consistent databases choose to remain available under failure, allowing for conflicting data to be stored in different replicas (later repaired by background processes). Weakening the consistency guarantees improves not only availability, but also performance, as the number of replicas involved in a given operation can be minimized. There are, however, use-cases that require the opposite trade-off. Indeed, Apache Cassandra and Scylla provide Lightweight Transactions (LWT), which allow single-key linearizable updates. The mechanism underlying LWT is asynchronous consensus in the form of the Raft algorithm. In this talk, we'll describe the characteristics and requirements of Scylla's consensus implementation, and how it enables strongly consistent updates. We will also cover how consensus can be applied to other aspects of the system, such as schema changes, node membership, and range movements, in order to improve their reliability and safety. We will thus show that an eventually consistent database can leverage consensus without compromising either availability or performance.

Duarte Nunes

February 02, 2019
Tweet

More Decks by Duarte Nunes

Other Decks in Technology

Transcript

  1. Presenter bio Duarte is a Software Engineer working on Scylla.

    He has a background in concurrent programming, distributed systems and low-latency software. Prior to ScyllaDB, he worked on distributed network virtualization.
  2. ScyllaDB ▪ Clustered NoSQL database compatible with Apache Cassandra ▪

    ~10X throughput on same hardware ▪ Low latency, esp. higher percentiles ▪ Self tuning ▪ Mechanically sympathetic C++17
  3. Data placement ▪ Masterless ▪ Data is replicated across a

    set of replicas ▪ Data is partitioned across all nodes ▪ An operation can specify a Consistency Level
  4. Data model Partition Key1 Clustering Key1 Clustering Key1 Clustering Key2

    Clustering Key2 ... ... ... ... ... CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id)); INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’'); Sorted by Primary Key
  5. Seastar - http://seastar.io ▪ Thread-per-core design • No blocking, ever.

    ▪ Asynchronous networking, file I/O, multicore • Future/promise based APIs
  6. Thread-per-core ▪ Request path is replicated per CPU ▪ Internal

    sharding by partition key • Like having multiple nodes on the same machine ▪ Uses middle bits to avoid aliasing a node with a shard
  7. Strong consistency ▪ Strong guarantees enable more use cases •

    Uniqueness constraints • Read-modify-write accesses
  8. Strong consistency ▪ Strong guarantees enable more use cases •

    Uniqueness constraints • Read-modify-write accesses • All-or-nothing writes
  9. Strong consistency ▪ Strong guarantees enable more use cases •

    Uniqueness constraints • Read-modify-write accesses • All-or-nothing writes ▪ Opt-in, due to inherent performance costs
  10. Lightweight Transactions (LWT) ▪ Per-partition strong consistency ▪ Essentially, a

    distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ?
  11. Lightweight Transactions (LWT) ▪ Per-partition strong consistency ▪ Essentially, a

    distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ? ▪ Fast-path operation warranting high performance and availability
  12. Lightweight Transactions (LWT) ▪ Per-partition strong consistency ▪ Essentially, a

    distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ? ▪ Fast-path operation warranting high performance and availability ▪ Requires internal read-before-write
  13. Definition “The process by which we reach agreement over system

    state between unreliable machines connected by asynchronous networks”
  14. Consensus Protocols ▪ Strong consistency guarantees about the underlying data

    ▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit
  15. Consensus Protocols ▪ Strong consistency guarantees about the underlying data

    ▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit ▪ Tolerate non-byzantine failures • 2F + 1 nodes to tolerate F failures
  16. Consensus Protocols ▪ Strong consistency guarantees about the underlying data

    ▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit ▪ Tolerate non-byzantine failures • 2F + 1 nodes to tolerate F failures ▪ A consensus protocol round advances the underlying state
  17. ▪ Stability • If a value is decided at a

    replica p, it remains decided forever ▪ Agreement • No two replicas should decide differently ▪ Validity • If a value is decided, this value must have been proposed by at least one of the replicas ▪ Termination • Eventually, a decision is reached on all correct replicas Guarantees
  18. Paxos Made Live “There are significant gaps between the description

    of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.” By Google, when building Chubby using Multi-Paxos and SMR
  19. Design Space ▪ Understandability ▪ Experienced & successful, well-known usages

    ▪ Latency, RTTs to agreement ▪ General performance (e.g., batching opportunities)
  20. Leader vs RTTs ▪ Any node can decide a value

    • At least 2RTTs • Classical Paxos, CASPaxos
  21. Leader vs RTTs ▪ Any node can decide a value

    • At least 2RTTs • Classical Paxos, CASPaxos ▪ Leader election • 1 RTT, but leader can limit throughput • Multi-paxos, Raft, Zab
  22. Challenges ▪ Dealing with limited storage capacity ▪ Effectively handling

    read-only requests ▪ Dynamic membership and cluster reconfiguration ▪ Multi-key transaction support ▪ Acceptable performance over the WAN ▪ Formal and empirical validation of its safety
  23. Raft ▪ Focused on understandability ▪ Widely used ▪ Amenable

    to nice optimizations ▪ Strong leadership ▪ For LWT, leader does read-before-write ▪ Log easily compacted in our case
  24. Raft guarantees (1/2) ▪ Election Safety • At most one

    leader can be elected in a given term ▪ Leader Append-Only • A leader only appends new entries to its log ▪ Log Matching • If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index
  25. Raft guarantees (2/2) ▪ Leader Completeness • If a log

    entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms ▪ State Machine Safety • If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index
  26. Design ▪ A Scylla node participates in more than one

    group • A natural consequence of how data is partitioned
  27. Design ▪ A Scylla node participates in more than one

    group • A natural consequence of how data is partitioned • Increased concurrency
  28. Design ▪ A Scylla node participates in more than one

    group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations*
  29. Design ▪ A Scylla node participates in more than one

    group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded
  30. Design ▪ A Scylla node participates in more than one

    group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded • A shard handles a subset of the operations of the group
  31. Design ▪ A Scylla node participates in more than one

    group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded • A shard handles a subset of the operations of the group • Impacts how logs are organized
  32. Organization Log Core Raft Database State RPC Log Core Raft

    Database State RPC Group 0 Group N - 1
  33. Organization Shard 0 Shard N - 1 ... Log Core

    Raft Database State RPC Log Core Raft Database State RPC Group 0 Group N - 1
  34. Organization Heartbeats Shard 0 Shard N - 1 ... Log

    Core Raft Database State RPC Log Core Raft Database State RPC Group 0 Group N - 1
  35. Write path (simplified) 1. lock() locked_cell[] 7. Release locks mutation

    mutation restrictions Log RPC ... 3. Match 4. Append if matched cell_locker 2. query() 5. Replicate to majority 6. apply() Node N, Shard S Database
  36. Sharding ▪ State explosion if groups are per-shard ▪ (term,

    leader shard, index) tuple ▪ Heterogeneous cluster ▪ Resharding
  37. ▪ All data is correctly partitioned across the cluster Homogeneous

    nodes RPC Leader Shard 0 RPC Follower Shard 0
  38. ▪ Entries not totally ordered at the follower Heterogeneous nodes

    (1) RPC Leader Shard 0 RPC Shard 1 RPC Follower Shard 0
  39. ▪ Follower logs will contain gaps Heterogeneous nodes (2) RPC

    Follower Shard 0 RPC Shard 1 RPC Leader Shard 0
  40. Sharding solutions ▪ Organize logs by term and leader shard

    • For each term, the leader shard count is stable
  41. Sharding solutions ▪ Organize logs by term and leader shard

    • For each term, the leader shard count is stable ▪ Leader restart ends the term
  42. Sharding solutions ▪ Organize logs by term and leader shard

    • For each term, the leader shard count is stable ▪ Leader restart ends the term ▪ A log is stored for each leader shard • May require synchronization at followers with different shard count
  43. Log compaction ▪ Log entries are applied to the state

    machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard
  44. Log compaction ▪ Log entries are applied to the state

    machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard ▪ Committed entries can be discarded • Responsibility for a prefix of the log is transferred to the database
  45. Log compaction ▪ Log entries are applied to the state

    machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard ▪ Committed entries can be discarded • Responsibility for a prefix of the log is transferred to the database ▪ Leverage repair • Benign overlap between log entries and database state • How many committed log segments to keep around?
  46. Membership changes ▪ For the set of keyspaces K, each

    node will be a member of ks ∈ K RF(ks) * vnodes • TL;DR: potentially many groups ▪ Raft restricts the types of changes that are allowed • Only one server can be added or removed from a group at a time • Complex changes implemented as a series of single-server changes • Unless the RF is being changed, group member count is constant
  47. Bootstrap + Raft (1) ▪ The ranges for the Raft

    group of which B is the primary replica have changed ▪ Node D introduces a new Raft group, of which D, B and C are members ▪ Node D joins the Raft group of which A is the primary replica ▪ Node C leaves the Raft group of which A is the primary replica
  48. Bootstrap + Raft (2) ▪ Need to wait until the

    new node has joined all groups • Then the other nodes can exit groups ▪ Doing a single operation at a time ensures majorities overlap ▪ Configurations are special log entries ▪ New nodes become non-voting members until they are caught up
  49. Another LWT constraint ▪ LOCAL_SERIAL consistency level • Precludes cross-DC

    groups • Each DC has its own group for a token range • Need agreement between two groups ▪ More up-front work, but worth it later • Leveraged for multi-partition transactions
  50. Concurrent schema changes ▪ Schemas changes are carried out locally

    and then propagated throughout the cluster ▪ No protection against concurrent schema changes • Different IDs even if the schema is the same • Cluster-wide order of changes is not enforced
  51. Distributed schema tables CREATE TABLE ks.t ( p int, x

    my_type PRIMARY KEY (p)); DROP TYPE ks.my_type;
  52. Distributed schema tables CREATE TABLE ks.t ( p int, x

    my_type PRIMARY KEY (p)); DROP TYPE ks.my_type;
  53. Range movements ▪ Concurrent range operations can select overlapping token

    ranges ▪ Token selection is not optimal ▪ Can’t use a partitioning approach
  54. Range movements ▪ Concurrent range operations can select overlapping token

    ranges ▪ Token selection is not optimal ▪ Can’t use a partitioning approach • Need to centralize token selection
  55. Consistent materialized views ▪ Views are updated asynchronously • Eventually

    consistent • Preserves base replica availability • Issues with consistency, flow control
  56. Consistent materialized views ▪ Views are updated asynchronously • Eventually

    consistent • Preserves base replica availability • Issues with consistency, flow control ▪ Can leverage multi-key transactions • Base and view are both updated, or none are • Strongly consistent