Slide 1

Slide 1 text

Data Structures in Russell Brown Sean Cribbs

Slide 2

Slide 2 text

Riak is Eventually-Consistent

Slide 3

Slide 3 text

Eventual Consistency Replicated Loose coordination Convergence 1 2 3

Slide 4

Slide 4 text

✔Fault-tolerant ✔Highly available ✔Low-latency Eventual is Good

Slide 5

Slide 5 text

No clear winner! Throw one out? Keep both? Consistency? 1 2 3 B A

Slide 6

Slide 6 text

No clear winner! Throw one out? Keep both? Consistency? 1 2 3 B A Cassandra

Slide 7

Slide 7 text

No clear winner! Throw one out? Keep both? Consistency? 1 2 3 B A Cassandra Riak & Voldemort

Slide 8

Slide 8 text

Conflicts! A! B!

Slide 9

Slide 9 text

Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl

Slide 10

Slide 10 text

Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl list of siblings

Slide 11

Slide 11 text

Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  ;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  ;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  ;  rel="up"

Slide 12

Slide 12 text

Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  ;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  ;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  ;  rel="up" all the values

Slide 13

Slide 13 text

Semantic Resolution • Your app knows the domain - use business rules to resolve • Amazon Dynamo’s shopping cart

Slide 14

Slide 14 text

Semantic Resolution • Your app knows the domain - use business rules to resolve • Amazon Dynamo’s shopping cart BAD

Slide 15

Slide 15 text

Semantic Resolution • Your app knows the domain - use business rules to resolve • Amazon Dynamo’s shopping cart BAD “Ad hoc approaches have proven brittle and error-prone”

Slide 16

Slide 16 text

Goals ✔Meaningful values ✔Automatic resolution ✔Transparent to user

Slide 17

Slide 17 text

WARNING This is a lot of math. Side effects may include dry mouth, itchy rash, and a desire to go back for a PhD.

Slide 18

Slide 18 text

Monotonic Functions • Change in strictly a single direction • Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal

Slide 19

Slide 19 text

Monotonic Functions • Change in strictly a single direction • Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal

Slide 20

Slide 20 text

Monotonic Logic •Existing facts are never refuted •New facts can be added •“Knowledge only grows”

Slide 21

Slide 21 text

Monotonic Logic •Existing facts are never refuted •New facts can be added •“Knowledge only grows” “monotonicity of entailment”

Slide 22

Slide 22 text

http://db.cs.berkeley.edu/papers/UCB-lattice-tr.pdf

Slide 23

Slide 23 text

Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ

Slide 24

Slide 24 text

Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ S is a set

Slide 25

Slide 25 text

Bounded Join Semi-Lattice ⊥ ∈ S (minimal element) ʪS, ⊔, ⊥ʫ S is a set

Slide 26

Slide 26 text

Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z

Slide 27

Slide 27 text

Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y 㱻 x ⊔ y = y “partial order” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z

Slide 28

Slide 28 text

Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y 㱻 x ⊔ y = y “partial order” ∀x ∈ S: x ⊔ ⊥ = x “identity” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z

Slide 29

Slide 29 text

“Set” Lattice S = all finite sets ⊔ = set-union ⊥ = {}

Slide 30

Slide 30 text

“Set” Lattice {a} {b} {c} {d} {e} {a,b} {b,c} {c,d} {d,e} {a,b,c} {c,d,e} {b,c,d,e} {a,b,c,d} {b,c,d} {a,b,c,d,e} Time S = all finite sets ⊔ = set-union ⊥ = {}

Slide 31

Slide 31 text

Vector Clock

Slide 32

Slide 32 text

• Vector clock is a lattice... Vector Clock

Slide 33

Slide 33 text

• Vector clock is a lattice... Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)

Slide 34

Slide 34 text

• Vector clock is a lattice... • ...but the associated Riak value is non- monotonic, Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)

Slide 35

Slide 35 text

• Vector clock is a lattice... • ...but the associated Riak value is non- monotonic, • ...and the vclock is not meaningful to the client. Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)

Slide 36

Slide 36 text

http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf

Slide 37

Slide 37 text

CRDT Flavors • Convergent (state-based) • One replica updates, then forwards entire state, downstream merges • Commutative (operation-based) • Only mutations (ops) communicated • Needs a reliable broadcast channel

Slide 38

Slide 38 text

CRDT Types Registers LWW, MV Counters Positive, P/N Sets Grow only, Two-Phase, Observed-Remove Graphs 2P-2P Lists Growable-array Collaborative editing Treedoc

Slide 39

Slide 39 text

Theory Into Practice

Slide 40

Slide 40 text

Riak DT •Riak Core Application •Runs alongside Riak KV •Own Storage

Slide 41

Slide 41 text

•HTTP API •-­‐behaviour(riak_dt). •State-based Riak DT

Slide 42

Slide 42 text

•new/0 empty CRDT •value/1 the resolved value •update/3 mutate CRDT •merge/2 converge two CRDTs •equal/2 compare internal value CRDT Behaviour

Slide 43

Slide 43 text

•Counters •G-Counter •PN-Counter •Sets •G-Set •OR-Set CRDTs implemented

Slide 44

Slide 44 text

G-Counter •Simple version vector (28 LoC) [{ActorId,Count}] •Update: increment actor’s count •Merge: greatest value per Actor •Value: sum of Counts

Slide 45

Slide 45 text

G-Counter new()  -­‐>        []. value(GCnt)  -­‐>        sum([Cnt  ||  {_Act,  Cnt}  <-­‐  GCnt]). equal(VA,VB)  -­‐>        lists:sort(VA)  =:=  lists:sort(VB).

Slide 46

Slide 46 text

PN-Counter •2 x G-Counter •P - N = value {    P  =  [{a,10},{b,2}],    N  =  [{a,1},{c,5}] } (10  +  2)  -­‐  (1  +  5)      =  12  -­‐  6      =  6

Slide 47

Slide 47 text

Riak DT In Action •Bitcask storage per vnode •Value / Update FSM per request •Webmachine resource(s) e.g. GET  /counters/key

Slide 48

Slide 48 text

Update FSM •Sync call update on vnode •Read, Local Update, Reply •Async send merge to replicas •Await W responses •Reply to client

Slide 49

Slide 49 text

Value FSM (Read) •Async call value on all replicas •Await R replies •Merge all replies with merge/2 •Return merged value to client •Read Repair

Slide 50

Slide 50 text

Read Repair •Compare answers to merged result using equal/2 •Send merge to stale replicas

Slide 51

Slide 51 text

Multi-Datacenter •Behaviour addition •rollup/2 collapsed local view •Counters •Roll up all actors in cluster: [{ClusterId,Count}]

Slide 52

Slide 52 text

Trade-Offs •Update: Primary only •Secondary/Fallbacks may Merge •Read-before-Write in the request path •PW=DW=1 by default

Slide 53

Slide 53 text

Demo!!!!11!

Slide 54

Slide 54 text

Garbage! •Counters •Dead actors •Sets •Tombstones

Slide 55

Slide 55 text

Elegance = Punt •GC is non- monotonic! •Needs consensus to collect

Slide 56

Slide 56 text

And then? •Stats/Metrics & Polish •Multi-Datacenter Replication •Active Anti-Entropy

Slide 57

Slide 57 text

And then? •KV as storage •GC / low garbage datatypes •Op based / hybrid

Slide 58

Slide 58 text

Open Source Today Insert screenshot here

Slide 59

Slide 59 text

Questions? @russeldb @seancribbs