Data Structures in Riak

Data Structures in Riak

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

E0f4dbccf64a1d37a92e224b070ee84f?s=128

Basho Technologies

October 11, 2012
Tweet

Transcript

  1. Data Structures in Russell Brown Sean Cribbs

  2. Riak is Eventually-Consistent

  3. Eventual Consistency Replicated Loose coordination Convergence 1 2 3

  4. ✔Fault-tolerant ✔Highly available ✔Low-latency Eventual is Good

  5. No clear winner! Throw one out? Keep both? Consistency? 1

    2 3 B A
  6. No clear winner! Throw one out? Keep both? Consistency? 1

    2 3 B A Cassandra
  7. No clear winner! Throw one out? Keep both? Consistency? 1

    2 3 B A Cassandra Riak & Voldemort
  8. Conflicts! A! B!

  9. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH

    YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl
  10. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH

    YVKbIhEUl +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the   frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  text/plain Content-­‐Length:  102 Siblings: 16vic4eU9ny46o4KPiDz1f 4v5xOg4bVwUYZdMkqf0d6I 6nr5tDTmhxnwuAFJDd2s6G 6zRSZFUJlHXZ15o9CG0BYl list of siblings
  11. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl

    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  </riak/test>;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up"
  12. Siblings in Riak HTTP/1.1  300  Multiple  Choices X-­‐Riak-­‐Vclock:   a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl

    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA== Vary:  Accept,  Accept-­‐Encoding Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic) Date:  Fri,  30  Sep  2011  15:24:35  GMT Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh Content-­‐Length:  766 -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded Link:  </riak/test>;  rel="up" Etag:  16vic4eU9ny46o4KPiDz1f Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" Etag:  4v5xOg4bVwUYZdMkqf0d6I Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT {"bar":"baz"} -­‐-­‐YinLMzyUR9feB17okMytgKsylvh Content-­‐Type:  application/json Link:  </riak/test>;  rel="up" all the values
  13. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart
  14. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart BAD
  15. Semantic Resolution • Your app knows the domain - use

    business rules to resolve • Amazon Dynamo’s shopping cart BAD “Ad hoc approaches have proven brittle and error-prone”
  16. Goals ✔Meaningful values ✔Automatic resolution ✔Transparent to user

  17. WARNING This is a lot of math. Side effects may

    include dry mouth, itchy rash, and a desire to go back for a PhD.
  18. Monotonic Functions • Change in strictly a single direction •

    Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal
  19. Monotonic Functions • Change in strictly a single direction •

    Consecutive values may be equal • Monotonic: Linear, Exponential • Non-monotonic: Quadratic, Sinusoidal
  20. Monotonic Logic •Existing facts are never refuted •New facts can

    be added •“Knowledge only grows”
  21. Monotonic Logic •Existing facts are never refuted •New facts can

    be added •“Knowledge only grows” “monotonicity of entailment”
  22. http://db.cs.berkeley.edu/papers/UCB-lattice-tr.pdf

  23. Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ

  24. Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ S is a set

  25. Bounded Join Semi-Lattice ⊥ ∈ S (minimal element) ʪS, ⊔,

    ⊥ʫ S is a set
  26. Bounded Join Semi-Lattice ʪS, ⊔, ⊥ʫ ⊔ is a least-upper

    bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z
  27. Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y

    㱻 x ⊔ y = y “partial order” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z
  28. Bounded Join Semi-Lattice ∀x, y ∈ S: x ≤S y

    㱻 x ⊔ y = y “partial order” ∀x ∈ S: x ⊔ ⊥ = x “identity” ʪS, ⊔, ⊥ʫ ⊔ is a least-upper bound function ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z
  29. “Set” Lattice S = all finite sets ⊔ = set-union

    ⊥ = {}
  30. “Set” Lattice {a} {b} {c} {d} {e} {a,b} {b,c} {c,d}

    {d,e} {a,b,c} {c,d,e} {b,c,d,e} {a,b,c,d} {b,c,d} {a,b,c,d,e} Time S = all finite sets ⊔ = set-union ⊥ = {}
  31. Vector Clock

  32. • Vector clock is a lattice... Vector Clock

  33. • Vector clock is a lattice... Vector Clock S =

    all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)
  34. • Vector clock is a lattice... • ...but the associated

    Riak value is non- monotonic, Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)
  35. • Vector clock is a lattice... • ...but the associated

    Riak value is non- monotonic, • ...and the vclock is not meaningful to the client. Vector Clock S = all vectors of (Actor, Count) pairs ⊔ = All Actors, each with their max Count ⊥ = [] (empty vector)
  36. http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf

  37. CRDT Flavors • Convergent (state-based) • One replica updates, then

    forwards entire state, downstream merges • Commutative (operation-based) • Only mutations (ops) communicated • Needs a reliable broadcast channel
  38. CRDT Types Registers LWW, MV Counters Positive, P/N Sets Grow

    only, Two-Phase, Observed-Remove Graphs 2P-2P Lists Growable-array Collaborative editing Treedoc
  39. Theory Into Practice

  40. Riak DT •Riak Core Application •Runs alongside Riak KV •Own

    Storage
  41. •HTTP API •-­‐behaviour(riak_dt). •State-based Riak DT

  42. •new/0 empty CRDT •value/1 the resolved value •update/3 mutate CRDT

    •merge/2 converge two CRDTs •equal/2 compare internal value CRDT Behaviour
  43. •Counters •G-Counter •PN-Counter •Sets •G-Set •OR-Set CRDTs implemented

  44. G-Counter •Simple version vector (28 LoC) [{ActorId,Count}] •Update: increment actor’s

    count •Merge: greatest value per Actor •Value: sum of Counts
  45. G-Counter new()  -­‐>        []. value(GCnt)  -­‐>  

         sum([Cnt  ||  {_Act,  Cnt}  <-­‐  GCnt]). equal(VA,VB)  -­‐>        lists:sort(VA)  =:=  lists:sort(VB).
  46. PN-Counter •2 x G-Counter •P - N = value {

       P  =  [{a,10},{b,2}],    N  =  [{a,1},{c,5}] } (10  +  2)  -­‐  (1  +  5)      =  12  -­‐  6      =  6
  47. Riak DT In Action •Bitcask storage per vnode •Value /

    Update FSM per request •Webmachine resource(s) e.g. GET  /counters/key
  48. Update FSM •Sync call update on vnode •Read, Local Update,

    Reply •Async send merge to replicas •Await W responses •Reply to client
  49. Value FSM (Read) •Async call value on all replicas •Await

    R replies •Merge all replies with merge/2 •Return merged value to client •Read Repair
  50. Read Repair •Compare answers to merged result using equal/2 •Send

    merge to stale replicas
  51. Multi-Datacenter •Behaviour addition •rollup/2 collapsed local view •Counters •Roll up

    all actors in cluster: [{ClusterId,Count}]
  52. Trade-Offs •Update: Primary only •Secondary/Fallbacks may Merge •Read-before-Write in the

    request path •PW=DW=1 by default
  53. Demo!!!!11!

  54. Garbage! •Counters •Dead actors •Sets •Tombstones

  55. Elegance = Punt •GC is non- monotonic! •Needs consensus to

    collect
  56. And then? •Stats/Metrics & Polish •Multi-Datacenter Replication •Active Anti-Entropy

  57. And then? •KV as storage •GC / low garbage datatypes

    •Op based / hybrid
  58. Open Source Today Insert screenshot here

  59. Questions? @russeldb @seancribbs