Slide 1

Slide 1 text

Choose your own Consistency

Slide 2

Slide 2 text

A concurrent talk by: @cmeik & @tsantero

Slide 3

Slide 3 text

Leveraging Riak’s New Data Types Conflict-Free Replicated Datatypes Chris Meiklejohn @cmeik EFL Toronto 2013

Slide 4

Slide 4 text

cmeiklejohn

Slide 5

Slide 5 text

Riak Made Consistent Strong Consistency in Riak 2.0 Tom Santero @tsantero EFL Toronto 2013

Slide 6

Slide 6 text

@ basho.com tsantero

Slide 7

Slide 7 text

Riak Overview

Slide 8

Slide 8 text

Riak Overview Erlang implementation of Dynamo

Slide 9

Slide 9 text

Riak Overview Erlang implementation of Dynamo

Slide 10

Slide 10 text

Riak Object

Slide 11

Slide 11 text

Key Value

Slide 12

Slide 12 text

Riak Overview Consistent hashing

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Riak Overview Dynamic membership

Slide 17

Slide 17 text

Riak Overview Replication factor

Slide 18

Slide 18 text

Replica Replica Replica

Slide 19

Slide 19 text

High Availability “...any non-failing node can respond to any request” --Gilbert & Lynch

Slide 20

Slide 20 text

Eventual Consistency “Eventual consistency is a consistency model used in distributed computing that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value.” --Wikipedia

Slide 21

Slide 21 text

Riak Overview Two Writes: {Writer, Value, Time}

Slide 22

Slide 22 text

[{a, v1, t1}] [{b, v1, t2}] [{a, v1, t1}]

Slide 23

Slide 23 text

Riak Overview Last Writer Wins Allow Mult

Slide 24

Slide 24 text

Riak Overview Last Writer Wins [{b, v1, t2}] [{b, v1, t2}] [{b, v1, t2}]

Slide 25

Slide 25 text

Riak Overview Allow Mult [{a, v1, t1}, {b, v1, t2}] [{a, v1, t1}, {b, v1, t2}] [{a, v1, t1}, {b, v1, t2}]

Slide 26

Slide 26 text

User specificed Merge

Slide 27

Slide 27 text

CRDTs

Slide 28

Slide 28 text

CRDTs Convergent Replicated Data Types

Slide 29

Slide 29 text

CRDTs Commutative Replicated Data Types

Slide 30

Slide 30 text

CRDTs Synchronization-free data structures

Slide 31

Slide 31 text

CRDTs Monotonic and confluent; convergent

Slide 32

Slide 32 text

CRDTs Create siblings; resolve via merge.

Slide 33

Slide 33 text

The Theory

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Bounded Join Semilattices

Slide 38

Slide 38 text

Bounded Join Semilattices Partially ordered set; least upper bound; ACI.

Slide 39

Slide 39 text

Bounded Join Semilattices Associativity: (X · Y) · Z = X · (Y · Z)

Slide 40

Slide 40 text

Bounded Join Semilattices Commutativity: X · Y = Y · X

Slide 41

Slide 41 text

Bounded Join Semilattices Idempotence: X · X = X

Slide 42

Slide 42 text

Bounded Join Semilattices Objects grow over time; merge computes LUB

Slide 43

Slide 43 text

Bounded Join Semilattices Monotonic and confluent; convergent

Slide 44

Slide 44 text

Bounded Join Semilattices Map into another via monotone functions

Slide 45

Slide 45 text

Bounded Join Semilattices Examples

Slide 46

Slide 46 text

b a c a, b a, c a, b, c Set; merge function: union. b, c

Slide 47

Slide 47 text

3 5 7 5 7 7 Increasing natural; merge function: max.

Slide 48

Slide 48 text

F F T F T T Booleans; merge function: or.

Slide 49

Slide 49 text

x <= y montone f(x) <= f(y)

Slide 50

Slide 50 text

CvRDT Examples OR-SET

Slide 51

Slide 51 text

[ [{1, a}], [] ] [ [{1, a}], [] ]

Slide 52

Slide 52 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ]

Slide 53

Slide 53 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]

Slide 54

Slide 54 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ] [ [{1, a}, {2, a}], [{1, a}] ]

Slide 55

Slide 55 text

[ [{1, a}], [] ] [ [{1, a}], [] ]

Slide 56

Slide 56 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ]

Slide 57

Slide 57 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ]

Slide 58

Slide 58 text

[ [{1, a}], [] ] [ [{1, a}], [] ] [ [{1, a}, {2, b}], [] ] [ [{1, a}], [{1, a}] ] [ [{1, a}, {2, b}], [{1, a}] ]

Slide 59

Slide 59 text

riak_dt git clone [email protected]:basho/riak_dt.git

Slide 60

Slide 60 text

-­‐type  crdt()  ::  term(). -­‐type  operation()  ::  term(). -­‐type  actor()  ::  term(). -­‐type  value()  ::  term(). -­‐type  error()  ::  term(). -­‐callback  new()  -­‐>  crdt(). -­‐callback  value(crdt())  -­‐>  term(). -­‐callback  value(term(),  crdt())  -­‐>  value(). -­‐callback  update(operation(),  actor(),  crdt())  -­‐>                         {ok,  crdt()}  |  {error,  error()}. -­‐callback  merge(crdt(),  crdt())  -­‐>  crdt(). -­‐callback  equal(crdt(),  crdt())  -­‐>  boolean(). -­‐callback  to_binary(crdt())  -­‐>  binary(). -­‐callback  from_binary(binary())  -­‐>  crdt(). riak_dt/src/riak_dt.erl

Slide 61

Slide 61 text

Riak 1.4

Slide 62

Slide 62 text

Riak 1.4 Counters: G-Counter, PN-Counter

Slide 63

Slide 63 text

Riak 1.4 Counters: non-idempotent; O(Actors)

Slide 64

Slide 64 text

riak_dt/src/riak_dt_gcounter.erl -­‐module(riak_dt_gcounter). -­‐export([new/0,  new/2,  value/1,  value/2,  update/3,  merge/2,  equal/2,  to_binary/1,  from_binary/1]). -­‐export_type([gcounter/0,  gcounter_op/0]). -­‐opaque  gcounter()  ::  orddict:orddict(). -­‐type  gcounter_op()  ::  increment  |  {increment,  pos_integer()}.

Slide 65

Slide 65 text

riak_dt/src/riak_dt_pncounter.erl -­‐module(riak_dt_pncounter). -­‐export([new/0,  new/2,  value/1,  value/2,                  update/3,  merge/2,  equal/2,  to_binary/1,  from_binary/1]). -­‐export_type([pncounter/0,  pncounter_op/0]). -­‐opaque  pncounter()    ::  [{Actor::riak_dt:actor(),  Inc::pos_integer(),                                                      Dec::pos_integer()}]. -­‐type  pncounter_op()  ::  riak_dt_gcounter:gcounter_op()  |  decrement_op(). -­‐type  decrement_op()  ::  decrement  |  {decrement,  pos_integer()}. -­‐type  pncounter_q()    ::  positive  |  negative.

Slide 66

Slide 66 text

Riak 2.0

Slide 67

Slide 67 text

Riak 2.0 Requires bucket types; HTTP and protobuff API

Slide 68

Slide 68 text

Riak 2.0 Caveats: 2i, MapReduce, Yokozuna

Slide 69

Slide 69 text

Riak 2.0 Sets: Add, Remove, Membership; Idempotent

Slide 70

Slide 70 text

Riak 2.0 Sets: Add wins; O(Actors + Elements)

Slide 71

Slide 71 text

riak_dt/src/riak_dt_gset.erl -­‐module(riak_dt_gset). -­‐behaviour(riak_dt). %%  API -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,                  to_binary/1,  from_binary/1,  value/2]). -­‐export_type([gset/0,  binary_gset/0,  gset_op/0]). -­‐opaque  gset()  ::  members(). -­‐type  binary_gset()  ::  binary(). -­‐type  gset_op()  ::  {add,  member()}.

Slide 72

Slide 72 text

riak_dt/src/riak_dt_orset.erl -­‐module(riak_dt_orset). -­‐behaviour(riak_dt). %%  API -­‐export([new/0,  value/1,  update/3,  merge/2,  equal/2,                  to_binary/1,  from_binary/1,  value/2,  precondition_context/1]). -­‐export_type([orset/0,  binary_orset/0,  orset_op/0]). -­‐opaque  orset()  ::  orddict:orddict(). -­‐type  binary_orset()  ::  binary().  %%  A  binary  that  from_binary/1  will   operate  on. -­‐type  orset_op()  ::  {add,  member()}  |  {remove,  member()}  |                                        {add_all,  [member()]}  |  {remove_all,  [member()]}  |                                        {update,  [orset_op()]}. -­‐type  actor()  ::  riak_dt:actor(). -­‐type  member()  ::  term().

Slide 73

Slide 73 text

riak_dt/src/riak_dt_orswot.erl -­‐module(riak_dt_orswot). -­‐behaviour(riak_dt). -­‐export_type([orswot/0,  orswot_op/0,  binary_orswot/0]). -­‐opaque  orswot()  ::  {riak_dt_vclock:vclock(),  entries()}. -­‐type  binary_orswot()  ::  binary().  %%  A  binary  that  from_binary/1  will  operate   on. -­‐type  orswot_op()  ::    {add,  member()}  |  {remove,  member()}  |                                            {add_all,  [member()]}  |  {remove_all,  [member()]}  |                                            {update,  [orswot_op()]}. -­‐type  orswot_q()    ::  size  |  {contains,  term()}. -­‐type  actor()  ::  riak_dt:actor(). -­‐type  entries()  ::  [{member(),  minimal_clock()}]. -­‐type  minimal_clock()  ::  [dot()]. -­‐type  dot()  ::  {actor(),  Count::pos_integer()}. -­‐type  member()  ::  term().

Slide 74

Slide 74 text

Riak 2.0 Maps: Recursive; Associative Array; Nestable

Slide 75

Slide 75 text

Riak 2.0 Maps: Update wins; O(Actors + Elements)

Slide 76

Slide 76 text

riak_dt/src/riak_dt_map.erl -­‐module(riak_dt_map). -­‐behaviour(riak_dt). %%  API -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,                  equal/2,  to_binary/1,  from_binary/1,  precondition_context/1]). -­‐export_type([map/0,  binary_map/0,  map_op/0]). -­‐type  binary_map()  ::  binary().  %%  A  binary  that  from_binary/1  will  accept -­‐type  map()  ::  {riak_dt_vclock:vclock(),  valuelist()}. -­‐type  field()  ::  {Name::term(),  Type::crdt_mod()}. -­‐type  crdt_mod()  ::  riak_dt_pncounter  |  riak_dt_lwwreg  |                                        riak_dt_od_flag  |                                        riak_dt_map  |  riak_dt_orswot. -­‐type  valuelist()  ::  [{field(),  entry()}]. -­‐type  entry()  ::  {minimal_clock(),  crdt()}. -­‐type  crdt()    ::    riak_dt_pncounter:pncounter()  |  riak_dt_od_flag:od_flag()  |                                    riak_dt_lwwreg:lwwreg()  |                                    riak_dt_orswot:orswot()  |                                    riak_dt_map:map(). -­‐type  map_op()  ::  {update,  [map_field_update()  |  map_field_op()]}.

Slide 77

Slide 77 text

Riak 2.0 Maps: LWW-Register, Booleans, Sets, and Maps

Slide 78

Slide 78 text

Riak 2.0 LWW-Register: last writer wins

Slide 79

Slide 79 text

-­‐module(riak_dt_lwwreg). -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,                  equal/2,  to_binary/1,  from_binary/1]). -­‐export_type([lwwreg/0,  lwwreg_op/0]). -­‐opaque  lwwreg()  ::  {term(),  non_neg_integer()}. -­‐type  lwwreg_op()  ::  {assign,  term(),  non_neg_integer()}    |  {assign,   term()}. -­‐type  lww_q()  ::  timestamp. riak_dt/src/riak_dt_lwwreg.erl

Slide 80

Slide 80 text

Riak 2.0 Boolean: Enabled, Disabled; O(Actors)

Slide 81

Slide 81 text

-­‐module(riak_dt_enable_flag). -­‐behaviour(riak_dt). -­‐export([new/0,  value/1,  value/2,  update/3,  merge/2,  equal/2,  from_binary/1,  to_binary/1]). riak_dt/src/riak_dt_enable_flag.erl

Slide 82

Slide 82 text

This project is funded by the European Union, 7th Research Framework Programme, ICT call 10, grant agreement n°609551.

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

Strong Consistency

Slide 85

Slide 85 text

Strong Consistency Why?

Slide 86

Slide 86 text

Strong Consistency Why? atomicity

Slide 87

Slide 87 text

Strong Consistency Why? recency

Slide 88

Slide 88 text

Strong Consistency Why? partial writes

Slide 89

Slide 89 text

A A A

Slide 90

Slide 90 text

A A A Val = <<“B”>>.

Slide 91

Slide 91 text

A A A Val = <<“B”>>.

Slide 92

Slide 92 text

B A A

Slide 93

Slide 93 text

B A A riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>).

Slide 94

Slide 94 text

B A A riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>).

Slide 95

Slide 95 text

B A A riakc_pb_socket:get(PBC, <<“Bucket”>>, <<“Key”>>). B B

Slide 96

Slide 96 text

Strong Consistency Single key atomic operations

Slide 97

Slide 97 text

Strong Consistency any get sees most recent put

Slide 98

Slide 98 text

Strong Consistency get/modify/put cycle fails if object is changed

Slide 99

Slide 99 text

Strong Consistency put w/o vclock fails if object exists

Slide 100

Slide 100 text

Consensus

Slide 101

Slide 101 text

Distributed Consensus “The problem of reaching agreement among remote processes is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed file management, and fault-tolerant distributed applications.” --Fischer, Lynch, Paterson

Slide 102

Slide 102 text

Consensus Guarantees: termination, agreement, validity

Slide 103

Slide 103 text

Termination processes eventually decide on a value

Slide 104

Slide 104 text

Agreement processes that decide do so on the same value

Slide 105

Slide 105 text

Validity values must have been proposed

Slide 106

Slide 106 text

Consensus Algorithms

Slide 107

Slide 107 text

Consensus Algorithms Paxos, ZAB, Raft

Slide 108

Slide 108 text

Paxos Lamport 1990: “The Part-Time Parliament”

Slide 109

Slide 109 text

No content

Slide 110

Slide 110 text

ZAB ZooKeeper Atomic Broadcast

Slide 111

Slide 111 text

Raft Ousterhout, Ongaro 2013: In search of an understandable consensus algorithm

Slide 112

Slide 112 text

Paxos coordinated requests; leaders

Slide 113

Slide 113 text

Paxos 2 round trips / request

Slide 114

Slide 114 text

Node 1 Node 2 Node 3 N++ prepare(N) promise(N, V ) b promise(N, V ) c V = f(V , V , V ) b c a N commit(N, V ) N accept(N)

Slide 115

Slide 115 text

Multi-Paxos First Request

Slide 116

Slide 116 text

Node 1 Node 2 Node 3 N++; I = 0 prepare(N, I) promise(N, I, V ) b promise(N, I, V ) c V = f(V , V , V ) b c a N commit(N, I, V ) N accept(N, I)

Slide 117

Slide 117 text

Multi-Paxos Each Additional Request

Slide 118

Slide 118 text

Node 1 Node 2 Node 3 I++ commit(N, I, V) accept(N, I)

Slide 119

Slide 119 text

Multi-Paxos Each Request: ship entire state

Slide 120

Slide 120 text

Riak

Slide 121

Slide 121 text

Riak Key/Value

Slide 122

Slide 122 text

Riak Keys are Independent

Slide 123

Slide 123 text

Riak read-repair, active-anti entropy

Slide 124

Slide 124 text

Riak individual key = isolated state

Slide 125

Slide 125 text

Riak multi-paxos per key

Slide 126

Slide 126 text

Consensus Groups

Slide 127

Slide 127 text

Consensus Groups participants in decisioning; ensembles

Slide 128

Slide 128 text

Consensus Groups preference lists (preflists) in Riak

Slide 129

Slide 129 text

preflist

Slide 130

Slide 130 text

No content

Slide 131

Slide 131 text

No content

Slide 132

Slide 132 text

No content

Slide 133

Slide 133 text

No content

Slide 134

Slide 134 text

Consensus Groups one group/ensemble per preflist

Slide 135

Slide 135 text

Consensus Groups ring_size = 256, 256 ensembles

Slide 136

Slide 136 text

Ensembles

Slide 137

Slide 137 text

Ensembles leader election; Epochs; get/put operations

Slide 138

Slide 138 text

Get Operations leader reads local object; if Epoch old: refresh

Slide 139

Slide 139 text

Node 1 Node 2 Node 3 obj.epoch < epoch get(key) reply(Epoch , Seq , Val ) b Val = latest(Val , Val , Val ) Val.epoch = epoch write(Epoch, ++Seq, Val) ack(Epoch, Seq) b b reply(Epoch , Seq , Val ) c c c a b c

Slide 140

Slide 140 text

Node 1 Node 2 Node 3 obj.epoch == epoch Reply = local_get(Key)

Slide 141

Slide 141 text

Get Operations Worst Case: 2 roundtrips / request Best Case: 0 roundtrips / request

Slide 142

Slide 142 text

Put Operations leader reads local object; if Epoch old: refresh modify object commit modified object

Slide 143

Slide 143 text

Node 1 Node 2 Node 3 obj.epoch < epoch get(key) reply(Epoch , Seq , Val ) b Latest = latest(Val , Val , Val ) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq) b b reply(Epoch , Seq , Val ) c c c a b c

Slide 144

Slide 144 text

Node 1 Node 2 Node 3 obj.epoch == epoch Latest = local_get(Key) Val = modify(Latest) write(Epoch, ++Seq, Val) ack(Epoch, Seq)

Slide 145

Slide 145 text

Put Operations Worst Case: 2 roundtrips / write Best Case: 1 roundtrips / write

Slide 146

Slide 146 text

Failed Quorums Leader Re-election ; new Epoch

Slide 147

Slide 147 text

Cluster Membership

Slide 148

Slide 148 text

Cluster Membership add/remove nodes

Slide 149

Slide 149 text

Cluster Membership consensus state

Slide 150

Slide 150 text

Cluster Membership multi-paxos joint consensus

Slide 151

Slide 151 text

Existing Ensemble Joining Ensemble riak_01 riak_02 riak_03 riak_07 riak_08 riak_09 [{riak_01}, {riak_02}, {riak_03}] [{riak_07}, {riak_08}, {riak_09}]

Slide 152

Slide 152 text

Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

Slide 153

Slide 153 text

Joint-Consensus Ensemble [{riak_01}, {riak_02}, {riak_03}, {riak_07}, {riak_08}, {riak_09}]

Slide 154

Slide 154 text

New Ensemble riak_07 riak_08 riak_09 [{riak_07}, {riak_08}, {riak_09}]

Slide 155

Slide 155 text

Q & A