Slide 1

Slide 1 text

Consistency and Riak Christopher Meiklejohn Riak Meetup, Paris, February 2015 @cmeik

Slide 2

Slide 2 text

History

Slide 3

Slide 3 text

Published SOSP 2007; key-value storage system Amazon Dynamo

Slide 4

Slide 4 text

Focused on high-availability and low-latency Amazon Dynamo

Slide 5

Slide 5 text

Collection of distributed systems techniques Amazon Dynamo

Slide 6

Slide 6 text

LinkedIn Voldemort, Facebook Cassandra Amazon Dynamo

Slide 7

Slide 7 text

Released 2009; Apache2 licensed Dynamo clone Basho Riak

Slide 8

Slide 8 text

Riak Architecture

Slide 9

Slide 9 text

Consistent Hashing hash(bucket/key)

Slide 10

Slide 10 text

hash ring

Slide 11

Slide 11 text

tokenize it

Slide 12

Slide 12 text

node 0 node 1 node 2 hash(key)

Slide 13

Slide 13 text

node 0 node 1 node 2 Replicas are stored to the N - 1 contiguous partitions

Slide 14

Slide 14 text

node 0 node 1 node 2 hash(companies/cisco) Replicas are stored to the N - 1 contiguous partitions

Slide 15

Slide 15 text

node 0 node 1 node 2 hash(companies/cisco) Replicas are stored to the N - 1 contiguous partitions

Slide 16

Slide 16 text

node 0 node 1 node 2

Slide 17

Slide 17 text

Scaling out node 0 node 1 node 2 node 3 +

Slide 18

Slide 18 text

Quorum requests N R W PR/PW DW

Slide 19

Slide 19 text

Vector Clocks establish temporality

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Anatomy of a Request get(users/clay-davis)

Slide 23

Slide 23 text

Anatomy of a Request get(users/clay-davis) client Riak

Slide 24

Slide 24 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak

Slide 25

Slide 25 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak hash(users/clay-davis) == 10, 11, 12

Slide 26

Slide 26 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak hash(users/clay-davis) == 10, 11, 12 Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring

Slide 27

Slide 27 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak get(users/clay-davis) Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring

Slide 28

Slide 28 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2

Slide 29

Slide 29 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak Coordinating node Cluster 6 7 8 9 10 11 12 13 14 15 16 The Ring R=2 obj

Slide 30

Slide 30 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak R=2 obj obj

Slide 31

Slide 31 text

Anatomy of a Request get(users/clay-davis) Get Handler (FSM) client Riak R=2 obj obj

Slide 32

Slide 32 text

Anatomy of a Request get(users/clay-davis) obj

Slide 33

Slide 33 text

Read Repair (Anti-Entropy)

Slide 34

Slide 34 text

replica replica replica

Slide 35

Slide 35 text

replica replica replica X

Slide 36

Slide 36 text

replica replica replica replica replica replica

Slide 37

Slide 37 text

Active Anti-Entropy (self healing clusters)

Slide 38

Slide 38 text

real-time updates persistent non-blocking disk-based

Slide 39

Slide 39 text

merkle tree to track changes coordinated at the vnode level runs as a background process exchange with neighbor vnodes for inconsistencies resolution semantics: trigger read-repair

Slide 40

Slide 40 text

= hashes marked dirty

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

= keys to read-repair

Slide 46

Slide 46 text

Riak and Consistency

Slide 47

Slide 47 text

Riak Object

Slide 48

Slide 48 text

BKey Value

Slide 49

Slide 49 text

Consistent hashing; dynamic membership Data Placement

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Replication per-value across ring Data Placement

Slide 54

Slide 54 text

Replica Replica Replica

Slide 55

Slide 55 text

Take the form: {Writer, Value, Time} Concurrent writes

Slide 56

Slide 56 text

[{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] Concurrent writes

Slide 57

Slide 57 text

[{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] [{b, v1, t2}] [{b, v1, t2}] [{b, v1, t2}] Last Writer Wins

Slide 58

Slide 58 text

[{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}] [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] [[{a, v1, t1}, {b, v1, t2}] Allow Mult

Slide 59

Slide 59 text

User specificed Merge

Slide 60

Slide 60 text

Two Approaches

Slide 61

Slide 61 text

Strong Eventual Consistency

Slide 62

Slide 62 text

Designed for convergence; allows divergence Conflict-free Replicated Data Types

Slide 63

Slide 63 text

Solves the Dynamo concurrency anomaly Conflict-free Replicated Data Types

Slide 64

Slide 64 text

The Theory

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

Two flavors: state-based and operation-based Conflict-free Replicated Data Types

Slide 70

Slide 70 text

Counters, Flags, Registers, Sets, Maps, Graphs Conflict-free Replicated Data Types

Slide 71

Slide 71 text

Broadcast update operation Operation-Based CRDTs

Slide 72

Slide 72 text

Commutative; relies on unique delivery Operation-Based CRDTs

Slide 73

Slide 73 text

Apply change locally; propagate entire state State-Based CRDTs

Slide 74

Slide 74 text

State is merged between replicas State-Based CRDTs

Slide 75

Slide 75 text

Set of all states form a bounded join-semilattice State-Based CRDTs

Slide 76

Slide 76 text

Partially ordered set; join operation Bounded Join-Semilattice

Slide 77

Slide 77 text

Associativity: (X · Y) · Z = X · (Y · Z) Bounded Join-Semilattice

Slide 78

Slide 78 text

Commutativity: X · Y = Y · X Bounded Join-Semilattice

Slide 79

Slide 79 text

Idempotence: X · X = X Bounded Join-Semilattice

Slide 80

Slide 80 text

Examples Bounded Join-Semilattice

Slide 81

Slide 81 text

b a c a, b a, c a, b, c Set; merge function: union. b, c

Slide 82

Slide 82 text

3 5 7 5 7 7 Increasing natural; merge function: max.

Slide 83

Slide 83 text

F F T F T T Booleans; merge function: or.

Slide 84

Slide 84 text

x <= y montone f(x) <= f(y)

Slide 85

Slide 85 text

Examples State-Based Observed-Remove Set

Slide 86

Slide 86 text

[ [{1, a}], [] ] [ [{1, a}], [] ]

Slide 87

Slide 87 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ]

Slide 88

Slide 88 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]

Slide 89

Slide 89 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]

Slide 90

Slide 90 text

[ [{1, a}], [] ] [ [{1, a}], [] ]

Slide 91

Slide 91 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ]

Slide 92

Slide 92 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ]

Slide 93

Slide 93 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, b}], [{1, a}] ]

Slide 94

Slide 94 text

Strong Consistency

Slide 95

Slide 95 text

Provides atomicity and recency Strong Consistency

Slide 96

Slide 96 text

Prohibits partial writes Strong Consistency

Slide 97

Slide 97 text

A A A

Slide 98

Slide 98 text

A A A Val = B

Slide 99

Slide 99 text

A A A Val = B

Slide 100

Slide 100 text

B A A

Slide 101

Slide 101 text

B A A Get Operation with Read Repair

Slide 102

Slide 102 text

B A A Get Operation with Read Repair

Slide 103

Slide 103 text

B A A Get Operation with Read Repair B B

Slide 104

Slide 104 text

Single key atomic operations Strong Consistency

Slide 105

Slide 105 text

Requires read/modify/write cycle (CAS) Strong Consistency

Slide 106

Slide 106 text

Consensus

Slide 107

Slide 107 text

Distributed Consensus The problem of reaching agreement among remote processes is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault-tolerant distributed applications. Fischer, Lynch, Paterson

Slide 108

Slide 108 text

Termination, agreement, validity The Consensus Problem

Slide 109

Slide 109 text

All processes eventually decide on a value Termination

Slide 110

Slide 110 text

All processes decide on the same value Agreement

Slide 111

Slide 111 text

Value decided on had to have been proposed Validity

Slide 112

Slide 112 text

Consensus Algorithms

Slide 113

Slide 113 text

Paxos, ZAB, Raft, etc. Consensus Algorithms

Slide 114

Slide 114 text

Coordinated requests with a chosen leader The Paxos Algorithm

Slide 115

Slide 115 text

Node 1 Node 2 Node 3 N++ prepare(N) promise(N, Vb) promise(N, Vc) Vn = f(Va, Vb, Vc) commit(N, Vn) accept(N)

Slide 116

Slide 116 text

First request Multi-Paxos

Slide 117

Slide 117 text

Node 1 Node 2 Node 3 N++; I = 0 prepare(N, I) promise(N, I, Vb) promise(N, I, Vc) Vn = f(Va, Vb, Vc) commit(N, I, Vn) accept(N, I)

Slide 118

Slide 118 text

Each additional request Multi-Paxos

Slide 119

Slide 119 text

Node 1 Node 2 Node 3 I++ commit(N, I, V) accept(N, I)

Slide 120

Slide 120 text

Ship entire state! Multi-Paxos

Slide 121

Slide 121 text

Riak

Slide 122

Slide 122 text

Key-value store; keys are independent state Riak

Slide 123

Slide 123 text

Multi-Paxos per key; CAS on isolated state Riak

Slide 124

Slide 124 text

Consensus Groups

Slide 125

Slide 125 text

Participants in decisioning; ensembles Consensus Groups

Slide 126

Slide 126 text

Use the preference list! Consensus Groups

Slide 127

Slide 127 text

preflist

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

No content

Slide 130

Slide 130 text

No content

Slide 131

Slide 131 text

No content

Slide 132

Slide 132 text

One ensemble per preference list; ring size Consensus Groups

Slide 133

Slide 133 text

Ensembles

Slide 134

Slide 134 text

election of leader; get/put operations Riak Ensembles

Slide 135

Slide 135 text

read local; refresh, if old Get Operations

Slide 136

Slide 136 text

Node 1 Node 2 Node 3 obj.epoch < epoch get(key) reply(Epochb, Seqb, Valb) Val = latest(Vala, Valb, Valc) Val.epoch = epoch write(Epoch, ++Seq, Val) ack(Epoch, Seq) reply(Epochc, Seqc, Valc)

Slide 137

Slide 137 text

Node 1 Node 2 Node 3 obj.epoch == epoch Reply = local_get(Key)

Slide 138

Slide 138 text

Worst Case: 2 roundtrips / write Get Operations Best Case: 0 roundtrips / write

Slide 139

Slide 139 text

read local; refresh, modify and commit if old Put Operations

Slide 140

Slide 140 text

Node 1 Node 2 Node 3 obj.epoch < epoch get(key) reply(Epochb, Seqb, Valb) Latest = latest(Vala, Valb, Valc) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq) reply(Epochc, Seqc, Valc)

Slide 141

Slide 141 text

Node 1 Node 2 Node 3 obj.epoch == epoch Latest = local_get(Key) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq)

Slide 142

Slide 142 text

Worst Case: 2 roundtrips / write Put Operations Best Case: 1 roundtrips / write

Slide 143

Slide 143 text

Elect a new leader; start a new epoch Failed Quorums

Slide 144

Slide 144 text

Cluster Membership

Slide 145

Slide 145 text

Use joint consensus from multi paxos Dynamic Membership

Slide 146

Slide 146 text

Existing Ensemble Joining Ensemble riak_01 riak_02 riak_03 riak_07 riak_08 riak_09 [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]

Slide 147

Slide 147 text

Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

Slide 148

Slide 148 text

Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

Slide 149

Slide 149 text

New Ensemble riak_07 riak_08 riak_09 [{riak_07}, {riak_08}, {riak_09}]

Slide 150

Slide 150 text

Single-key linearizability; reduced availability Strong Consistency

Slide 151

Slide 151 text

$ riak-admin bucket-type create strongly_consistent \ ‘{"props":{"consistent":true}}' $ riak-admin bucket-type status strongly_consistent $ riak-admin bucket-type activate strongly_consistent Enable strong consistency; http://docs.basho.com/riak/latest/dev/advanced/strong-consistency/

Slide 152

Slide 152 text

Conflict-Free Replicated Data Types Strong Eventual Consistency

Slide 153

Slide 153 text

$ riak-admin bucket-type create maps \ '{"props":{"datatype":"map"}}' $ riak-admin bucket-type create sets \ '{"props":{"datatype":"set"}}' $ riak-admin bucket-type create counters \ ‘{“props":{"datatype":"counter"}}' $ riak-admin bucket-type status maps $ riak-admin bucket-type activate maps Create bucket type for data types; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 154

Slide 154 text

$ curl -XPOST http://localhost:10018/types/counters/buckets/counters/ datatypes/traffic_tickets \ -H "Content-Type: application/json" \ -d '{"increment": 1}’ $ curl http://localhost:10018/types/counters/buckets/counters/ datatypes/traffic_tickets Operate on counters; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 155

Slide 155 text

$ curl -XPOST http://localhost:10018/types/sets/buckets/travel/ datatypes/cities \ -H "Content-Type: application/json" \ -d '{"add_all":["Toronto", “Montreal"]}' $ curl -XPOST http://localhost:10018/types/sets/buckets/travel/ datatypes/cities \ -H "Content-Type: application/json" \ -d '{"remove": “Montreal"}' $ curl http://localhost:10018/types/sets/buckets/travel/datatypes/ cities Operate on sets; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 156

Slide 156 text

$ curl -XPOST http://localhost:10018/types/maps/buckets/customers/ datatypes/ahmed_info \ -H "Content-Type: application/json" \ -d ' { "update": { "first_name_register": "Ahmed", "phone_number_register": "5551234567" } }' $ curl -XPOST http://localhost:8098/types/maps/buckets/customers/ datatypes/ahmed_info \ -H "Content-Type: application/json" \ -d ' { "update": { "annika_info_map": { "update": { "interests_set": { "add": "tango dancing" } } } } } ' Operate on maps; http://docs.basho.com/riak/latest/dev/using/data-types/

Slide 157

Slide 157 text

Questions?