Raft in Scylla

Raft in Scylla Consensus in an eventually consistent database @duarte_nunes

Presenter bio Duarte is a Software Engineer working on Scylla.
He has a background in concurrent programming, distributed systems and low-latency software. Prior to ScyllaDB, he worked on distributed network virtualization.

Introduction

ScyllaDB ▪ Clustered NoSQL database compatible with Apache Cassandra ▪
~10X throughput on same hardware ▪ Low latency, esp. higher percentiles ▪ Self tuning ▪ Mechanically sympathetic C++17

Data placement ▪ Masterless ▪ Data is replicated across a
set of replicas ▪ Data is partitioned across all nodes ▪ An operation can specify a Consistency Level

Data model Partition Key1 Clustering Key1 Clustering Key1 Clustering Key2
Clustering Key2 ... ... ... ... ... CREATE TABLE playlists (id int, song_id int, title text, PRIMARY KEY (id, song_id)); INSERT INTO playlists (id, song_id, title) VALUES (62, 209466, 'Ænima’'); Sorted by Primary Key

Consistency

Consistency ✓ ✓

Consistency ✓ ✓ ✓

Consistency ✓

Consistency ✓ ✓ ✓

Concurrent Updates 1: c = ‘a’ 1: c = ‘b’

Seastar - http://seastar.io ▪ Thread-per-core design • No blocking, ever.

Seastar - http://seastar.io ▪ Thread-per-core design • No blocking, ever.
▪ Asynchronous networking, file I/O, multicore • Future/promise based APIs

Thread-per-core ▪ Request path is replicated per CPU ▪ Internal
sharding by partition key • Like having multiple nodes on the same machine ▪ Uses middle bits to avoid aliasing a node with a shard

Motivation

Strong consistency ▪ Strong guarantees enable more use cases

Strong consistency ▪ Strong guarantees enable more use cases •
Uniqueness constraints

Uniqueness constraints • Read-modify-write accesses

Uniqueness constraints • Read-modify-write accesses • All-or-nothing writes

Uniqueness constraints • Read-modify-write accesses • All-or-nothing writes ▪ Opt-in, due to inherent performance costs

Lightweight Transactions (LWT) ▪ Per-partition strong consistency

Lightweight Transactions (LWT) ▪ Per-partition strong consistency ▪ Essentially, a
distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ?

distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ? ▪ Fast-path operation warranting high performance and availability

distributed compare-and-swap • INSERT ... IF NOT EXISTS DELETE ... IF EXISTS DELETE ... IF col_a = ? AND col_b = ? UPDATE ... IF col_a = ? AND col_b = ? ▪ Fast-path operation warranting high performance and availability ▪ Requires internal read-before-write

Consensus

Definition “The process by which we reach agreement over system
state between unreliable machines connected by asynchronous networks”

Consensus Protocols ▪ Strong consistency guarantees about the underlying data

▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit

▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit ▪ Tolerate non-byzantine failures • 2F + 1 nodes to tolerate F failures

▪ Leveraged to implement Replicated State Machines • A set of replicas to work together as a coherent unit ▪ Tolerate non-byzantine failures • 2F + 1 nodes to tolerate F failures ▪ A consensus protocol round advances the underlying state

▪ Stability • If a value is decided at a
replica p, it remains decided forever ▪ Agreement • No two replicas should decide differently ▪ Validity • If a value is decided, this value must have been proposed by at least one of the replicas ▪ Termination • Eventually, a decision is reached on all correct replicas Guarantees

Choosing an Algorithm

Design Space ▪ Understandability

Paxos Made Live “There are significant gaps between the description
of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.” By Google, when building Chubby using Multi-Paxos and SMR

Design Space ▪ Understandability ▪ Experienced & successful, well-known usages

▪ Latency, RTTs to agreement

▪ Latency, RTTs to agreement ▪ General performance (e.g., batching opportunities)

Leader vs RTTs ▪ Any node can decide a value
• At least 2RTTs • Classical Paxos, CASPaxos

Leader vs RTTs ▪ Any node can decide a value
• At least 2RTTs • Classical Paxos, CASPaxos ▪ Leader election • 1 RTT, but leader can limit throughput • Multi-paxos, Raft, Zab

Challenges ▪ Dealing with limited storage capacity ▪ Effectively handling
read-only requests ▪ Dynamic membership and cluster reconfiguration ▪ Multi-key transaction support ▪ Acceptable performance over the WAN ▪ Formal and empirical validation of its safety

Raft ▪ Focused on understandability ▪ Widely used ▪ Amenable
to nice optimizations ▪ Strong leadership ▪ For LWT, leader does read-before-write ▪ Log easily compacted in our case

Raft states Follower Candidate Leader

Raft components Consensus Module Log State Machine Client

Raft guarantees (1/2) ▪ Election Safety • At most one
leader can be elected in a given term ▪ Leader Append-Only • A leader only appends new entries to its log ▪ Log Matching • If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index

Raft guarantees (2/2) ▪ Leader Completeness • If a log
entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms ▪ State Machine Safety • If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index

Time in Raft follower/leader election Term T1 T2 T3 T4
Terms

Scylla Raft

Design Node Group Keyspace Vnode

Design ▪ A Scylla node participates in more than one
group • A natural consequence of how data is partitioned

group • A natural consequence of how data is partitioned • Increased concurrency

group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations*

group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded

group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded • A shard handles a subset of the operations of the group

group • A natural consequence of how data is partitioned • Increased concurrency • Leader failures affects only a subset of operations* ▪ Each group on a node is itself sharded • A shard handles a subset of the operations of the group • Impacts how logs are organized

Organization Log Core Raft Database State RPC Log Core Raft
Database State RPC Group 0 Group N - 1

Organization Shard 0 Shard N - 1 ... Log Core
Raft Database State RPC Log Core Raft Database State RPC Group 0 Group N - 1

Organization Heartbeats Shard 0 Shard N - 1 ... Log
Core Raft Database State RPC Log Core Raft Database State RPC Group 0 Group N - 1

Write path (simplified) 1. lock() locked_cell[] 7. Release locks mutation
mutation restrictions Log RPC ... 3. Match 4. Append if matched cell_locker 2. query() 5. Replicate to majority 6. apply() Node N, Shard S Database

Sharding ▪ State explosion if groups are per-shard ▪ (term,
leader shard, index) tuple ▪ Heterogeneous cluster ▪ Resharding

▪ All data is correctly partitioned across the cluster Homogeneous
nodes RPC Leader Shard 0 RPC Follower Shard 0

▪ Entries not totally ordered at the follower Heterogeneous nodes
(1) RPC Leader Shard 0 RPC Shard 1 RPC Follower Shard 0

▪ Follower logs will contain gaps Heterogeneous nodes (2) RPC
Follower Shard 0 RPC Shard 1 RPC Leader Shard 0

Sharding solutions ▪ Organize logs by term and leader shard
• For each term, the leader shard count is stable

• For each term, the leader shard count is stable ▪ Leader restart ends the term

• For each term, the leader shard count is stable ▪ Leader restart ends the term ▪ A log is stored for each leader shard • May require synchronization at followers with different shard count

Log compaction ▪ Log entries are applied to the state
machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard

machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard ▪ Committed entries can be discarded • Responsibility for a prefix of the log is transferred to the database

machine when committed • A log entry is committed when replicated to a majority of nodes • The commit index of an entry is per shard ▪ Committed entries can be discarded • Responsibility for a prefix of the log is transferred to the database ▪ Leverage repair • Benign overlap between log entries and database state • How many committed log segments to keep around?

Membership changes ▪ For the set of keyspaces K, each
node will be a member of ks ∈ K RF(ks) * vnodes • TL;DR: potentially many groups ▪ Raft restricts the types of changes that are allowed • Only one server can be added or removed from a group at a time • Complex changes implemented as a series of single-server changes • Unless the RF is being changed, group member count is constant

Adding a node A C B

Adding a node D A C A C B B
Bootstrap D

Bootstrap + Raft (1) ▪ The ranges for the Raft
group of which B is the primary replica have changed ▪ Node D introduces a new Raft group, of which D, B and C are members ▪ Node D joins the Raft group of which A is the primary replica ▪ Node C leaves the Raft group of which A is the primary replica

Bootstrap + Raft (2) ▪ Need to wait until the
new node has joined all groups • Then the other nodes can exit groups ▪ Doing a single operation at a time ensures majorities overlap ▪ Configurations are special log entries ▪ New nodes become non-voting members until they are caught up

Transactions

Another LWT constraint ▪ LOCAL_SERIAL consistency level • Precludes cross-DC
groups • Each DC has its own group for a token range • Need agreement between two groups ▪ More up-front work, but worth it later • Leveraged for multi-partition transactions

Agreement Node Group Keyspace Vnode Agreement

Replicated 2PC Coordinator Resource Managers Transaction Status

Internal Use Cases

Concurrent schema changes ▪ Schemas changes are carried out locally
and then propagated throughout the cluster

Concurrent schema changes ▪ Schemas changes are carried out locally
and then propagated throughout the cluster ▪ No protection against concurrent schema changes • Different IDs even if the schema is the same • Cluster-wide order of changes is not enforced

Distributed schema tables CREATE TABLE ks.t ( p int, x
my_type PRIMARY KEY (p)); DROP TYPE ks.my_type;

Range movements ▪ Concurrent range operations can select overlapping token
ranges

ranges ▪ Token selection is not optimal

ranges ▪ Token selection is not optimal ▪ Can’t use a partitioning approach

ranges ▪ Token selection is not optimal ▪ Can’t use a partitioning approach • Need to centralize token selection

Global Group

Specific Group

Consistent materialized views ▪ Views are updated asynchronously • Eventually
consistent • Preserves base replica availability • Issues with consistency, flow control

Consistent materialized views ▪ Views are updated asynchronously • Eventually
consistent • Preserves base replica availability • Issues with consistency, flow control ▪ Can leverage multi-key transactions • Base and view are both updated, or none are • Strongly consistent

Thank You Any Questions ? Please stay in touch [email protected]
@duarte_nunes

Raft in Scylla

Raft in Scylla

More Decks by Duarte Nunes

Other Decks in Technology

Featured

Transcript