Slide 1

Slide 1 text

Raft in Scylla Consensus in an eventually consistent database @duarte_nunes

Slide 2

Slide 2 text

Presenter bio Duarte is a Software Engineer working on Scylla. He has a background in concurrent programming, distributed systems and low-latency software. Prior to ScyllaDB, he worked on distributed network virtualization.

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

ScyllaDB ▪ Clustered NoSQL database compatible with Apache Cassandra ▪ ~10X throughput on same hardware ▪ Low latency, esp. higher percentiles ▪ Self tuning ▪ Mechanically sympathetic C++17

Slide 5

Slide 5 text

Data placement ▪ Masterless ▪ Data is replicated across a set of replicas ▪ Data is partitioned across all nodes ▪ An operation can specify a Consistency Level

Slide 6

Slide 6 text

Data model Partition Key1 Clustering Key1 Clustering Key1 Clustering Key2 Clustering Key2 ... ... ... ... ... CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id)); INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’'); Sorted by Primary Key

Slide 7

Slide 7 text

Consistency

Slide 8

Slide 8 text

Consistency ✓ ✓

Slide 9

Slide 9 text

Consistency ✓ ✓ ✓

Slide 10

Slide 10 text

Consistency ✓ ✓ ✓

Slide 11

Slide 11 text

Consistency ✓

Slide 12

Slide 12 text

Consistency ✓ ✓ ✓

Slide 13

Slide 13 text

Concurrent Updates 1: c = ‘a’ 1: c = ‘b’

Slide 14

Slide 14 text

Seastar - http://seastar.io ▪ Thread-per-core design • No blocking, ever.

Slide 15

Slide 15 text

Seastar - http://seastar.io ▪ Thread-per-core design • No blocking, ever. ▪ Asynchronous networking, file I/O, multicore • Future/promise based APIs

Slide 16

Slide 16 text

Thread-per-core ▪ Request path is replicated per CPU ▪ Internal sharding by partition key • Like having multiple nodes on the same machine ▪ Uses middle bits to avoid aliasing a node with a shard

Slide 17

Slide 17 text

Motivation

Slide 18

Slide 18 text

Strong consistency ▪ Strong guarantees enable more use cases

Slide 19

Slide 19 text

Strong consistency ▪ Strong guarantees enable more use cases • Uniqueness constraints

Slide 20

Slide 20 text

Strong consistency ▪ Strong guarantees enable more use cases • Uniqueness constraints • Read-modify-write accesses

Slide 21

Slide 21 text

Strong consistency ▪ Strong guarantees enable more use cases • Uniqueness constraints • Read-modify-write accesses • All-or-nothing writes

Slide 22

Slide 22 text

Strong consistency ▪ Strong guarantees enable more use cases • Uniqueness constraints • Read-modify-write accesses • All-or-nothing writes ▪ Opt-in, due to inherent performance costs

Slide 23

Slide 23 text

Lightweight Transactions (LWT) ▪ Per-partition strong consistency

Slide 24

Slide 24 text

Lightweight Transactions (LWT) ▪ Per-partition strong consistency ▪ Essentially, a distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ?

Slide 25

Slide 25 text

Lightweight Transactions (LWT) ▪ Per-partition strong consistency ▪ Essentially, a distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ? ▪ Fast-path operation warranting high performance and availability

Slide 26

Slide 26 text

Lightweight Transactions (LWT) ▪ Per-partition strong consistency ▪ Essentially, a distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ? ▪ Fast-path operation warranting high performance and availability ▪ Requires internal read-before-write

Slide 27

Slide 27 text

Consensus

Slide 28

Slide 28 text

Definition “The process by which we reach agreement over system state between unreliable machines connected by asynchronous networks”

Slide 29

Slide 29 text

Consensus Protocols ▪ Strong consistency guarantees about the underlying data

Slide 30

Slide 30 text

Consensus Protocols ▪ Strong consistency guarantees about the underlying data ▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit

Slide 31

Slide 31 text

Consensus Protocols ▪ Strong consistency guarantees about the underlying data ▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit ▪ Tolerate non-byzantine failures • 2F + 1 nodes to tolerate F failures

Slide 32

Slide 32 text

Consensus Protocols ▪ Strong consistency guarantees about the underlying data ▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit ▪ Tolerate non-byzantine failures • 2F + 1 nodes to tolerate F failures ▪ A consensus protocol round advances the underlying state

Slide 33

Slide 33 text

▪ Stability • If a value is decided at a replica p, it remains decided forever ▪ Agreement • No two replicas should decide differently ▪ Validity • If a value is decided, this value must have been proposed by at least one of the replicas ▪ Termination • Eventually, a decision is reached on all correct replicas Guarantees

Slide 34

Slide 34 text

Choosing an Algorithm

Slide 35

Slide 35 text

Design Space ▪ Understandability

Slide 36

Slide 36 text

Paxos Made Live “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.” By Google, when building Chubby using Multi-Paxos and SMR

Slide 37

Slide 37 text

Design Space ▪ Understandability ▪ Experienced & successful, well-known usages

Slide 38

Slide 38 text

Design Space ▪ Understandability ▪ Experienced & successful, well-known usages ▪ Latency, RTTs to agreement

Slide 39

Slide 39 text

Design Space ▪ Understandability ▪ Experienced & successful, well-known usages ▪ Latency, RTTs to agreement ▪ General performance (e.g., batching opportunities)

Slide 40

Slide 40 text

Leader vs RTTs ▪ Any node can decide a value • At least 2RTTs • Classical Paxos, CASPaxos

Slide 41

Slide 41 text

Leader vs RTTs ▪ Any node can decide a value • At least 2RTTs • Classical Paxos, CASPaxos ▪ Leader election • 1 RTT, but leader can limit throughput • Multi-paxos, Raft, Zab

Slide 42

Slide 42 text

Challenges ▪ Dealing with limited storage capacity ▪ Effectively handling read-only requests ▪ Dynamic membership and cluster reconfiguration ▪ Multi-key transaction support ▪ Acceptable performance over the WAN ▪ Formal and empirical validation of its safety

Slide 43

Slide 43 text

Raft ▪ Focused on understandability ▪ Widely used ▪ Amenable to nice optimizations ▪ Strong leadership ▪ For LWT, leader does read-before-write ▪ Log easily compacted in our case

Slide 44

Slide 44 text

Raft states Follower Candidate Leader

Slide 45

Slide 45 text

Raft components Consensus Module Log State Machine Client

Slide 46

Slide 46 text

Raft guarantees (1/2) ▪ Election Safety • At most one leader can be elected in a given term ▪ Leader Append-Only • A leader only appends new entries to its log ▪ Log Matching • If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index

Slide 47

Slide 47 text

Raft guarantees (2/2) ▪ Leader Completeness • If a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms ▪ State Machine Safety • If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index

Slide 48

Slide 48 text

Time in Raft follower/leader election Term T1 T2 T3 T4 Terms

Slide 49

Slide 49 text

Scylla Raft

Slide 50

Slide 50 text

Design Node Group Keyspace Vnode

Slide 51

Slide 51 text

Design ▪ A Scylla node participates in more than one group • A natural consequence of how data is partitioned

Slide 52

Slide 52 text

Design ▪ A Scylla node participates in more than one group • A natural consequence of how data is partitioned • Increased concurrency

Slide 53

Slide 53 text

Design ▪ A Scylla node participates in more than one group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations*

Slide 54

Slide 54 text

Design ▪ A Scylla node participates in more than one group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded

Slide 55

Slide 55 text

Design ▪ A Scylla node participates in more than one group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded • A shard handles a subset of the operations of the group

Slide 56

Slide 56 text

Design ▪ A Scylla node participates in more than one group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded • A shard handles a subset of the operations of the group • Impacts how logs are organized

Slide 57

Slide 57 text

Organization Log Core Raft Database State RPC Log Core Raft Database State RPC Group 0 Group N - 1

Slide 58

Slide 58 text

Organization Shard 0 Shard N - 1 ... Log Core Raft Database State RPC Log Core Raft Database State RPC Group 0 Group N - 1

Slide 59

Slide 59 text

Organization Heartbeats Shard 0 Shard N - 1 ... Log Core Raft Database State RPC Log Core Raft Database State RPC Group 0 Group N - 1

Slide 60

Slide 60 text

Write path (simplified) 1. lock() locked_cell[] 7. Release locks mutation mutation restrictions Log RPC ... 3. Match 4. Append if matched cell_locker 2. query() 5. Replicate to majority 6. apply() Node N, Shard S Database

Slide 61

Slide 61 text

Sharding ▪ State explosion if groups are per-shard ▪ (term, leader shard, index) tuple ▪ Heterogeneous cluster ▪ Resharding

Slide 62

Slide 62 text

▪ All data is correctly partitioned across the cluster Homogeneous nodes RPC Leader Shard 0 RPC Follower Shard 0

Slide 63

Slide 63 text

▪ Entries not totally ordered at the follower Heterogeneous nodes (1) RPC Leader Shard 0 RPC Shard 1 RPC Follower Shard 0

Slide 64

Slide 64 text

▪ Follower logs will contain gaps Heterogeneous nodes (2) RPC Follower Shard 0 RPC Shard 1 RPC Leader Shard 0

Slide 65

Slide 65 text

Sharding solutions ▪ Organize logs by term and leader shard • For each term, the leader shard count is stable

Slide 66

Slide 66 text

Sharding solutions ▪ Organize logs by term and leader shard • For each term, the leader shard count is stable ▪ Leader restart ends the term

Slide 67

Slide 67 text

Sharding solutions ▪ Organize logs by term and leader shard • For each term, the leader shard count is stable ▪ Leader restart ends the term ▪ A log is stored for each leader shard • May require synchronization at followers with different shard count

Slide 68

Slide 68 text

Log compaction ▪ Log entries are applied to the state machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard

Slide 69

Slide 69 text

Log compaction ▪ Log entries are applied to the state machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard ▪ Committed entries can be discarded • Responsibility for a prefix of the log is transferred to the database

Slide 70

Slide 70 text

Log compaction ▪ Log entries are applied to the state machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard ▪ Committed entries can be discarded • Responsibility for a prefix of the log is transferred to the database ▪ Leverage repair • Benign overlap between log entries and database state • How many committed log segments to keep around?

Slide 71

Slide 71 text

Membership changes ▪ For the set of keyspaces K, each node will be a member of ks ∈ K RF(ks) * vnodes • TL;DR: potentially many groups ▪ Raft restricts the types of changes that are allowed • Only one server can be added or removed from a group at a time • Complex changes implemented as a series of single-server changes • Unless the RF is being changed, group member count is constant

Slide 72

Slide 72 text

Adding a node A C B

Slide 73

Slide 73 text

Adding a node D A C A C B B Bootstrap D

Slide 74

Slide 74 text

Bootstrap + Raft (1) ▪ The ranges for the Raft group of which B is the primary replica have changed ▪ Node D introduces a new Raft group, of which D, B and C are members ▪ Node D joins the Raft group of which A is the primary replica ▪ Node C leaves the Raft group of which A is the primary replica

Slide 75

Slide 75 text

Bootstrap + Raft (2) ▪ Need to wait until the new node has joined all groups • Then the other nodes can exit groups ▪ Doing a single operation at a time ensures majorities overlap ▪ Configurations are special log entries ▪ New nodes become non-voting members until they are caught up

Slide 76

Slide 76 text

Transactions

Slide 77

Slide 77 text

Another LWT constraint ▪ LOCAL_SERIAL consistency level • Precludes cross-DC groups • Each DC has its own group for a token range • Need agreement between two groups ▪ More up-front work, but worth it later • Leveraged for multi-partition transactions

Slide 78

Slide 78 text

Agreement Node Group Keyspace Vnode Agreement

Slide 79

Slide 79 text

Replicated 2PC Coordinator Resource Managers Transaction Status

Slide 80

Slide 80 text

Internal Use Cases

Slide 81

Slide 81 text

Concurrent schema changes ▪ Schemas changes are carried out locally and then propagated throughout the cluster

Slide 82

Slide 82 text

Concurrent schema changes ▪ Schemas changes are carried out locally and then propagated throughout the cluster ▪ No protection against concurrent schema changes • Different IDs even if the schema is the same • Cluster-wide order of changes is not enforced

Slide 83

Slide 83 text

Distributed schema tables CREATE TABLE ks.t ( p int, x my_type PRIMARY KEY (p)); DROP TYPE ks.my_type;

Slide 84

Slide 84 text

Distributed schema tables CREATE TABLE ks.t ( p int, x my_type PRIMARY KEY (p)); DROP TYPE ks.my_type;

Slide 85

Slide 85 text

Range movements ▪ Concurrent range operations can select overlapping token ranges

Slide 86

Slide 86 text

Range movements ▪ Concurrent range operations can select overlapping token ranges ▪ Token selection is not optimal

Slide 87

Slide 87 text

Range movements ▪ Concurrent range operations can select overlapping token ranges ▪ Token selection is not optimal ▪ Can’t use a partitioning approach

Slide 88

Slide 88 text

Range movements ▪ Concurrent range operations can select overlapping token ranges ▪ Token selection is not optimal ▪ Can’t use a partitioning approach • Need to centralize token selection

Slide 89

Slide 89 text

Global Group

Slide 90

Slide 90 text

Specific Group

Slide 91

Slide 91 text

Consistent materialized views ▪ Views are updated asynchronously • Eventually consistent • Preserves base replica availability • Issues with consistency, flow control

Slide 92

Slide 92 text

Consistent materialized views ▪ Views are updated asynchronously • Eventually consistent • Preserves base replica availability • Issues with consistency, flow control ▪ Can leverage multi-key transactions • Base and view are both updated, or none are • Strongly consistent

Slide 93

Slide 93 text

Thank You Any Questions ? Please stay in touch duarte@scylladb.com @duarte_nunes