Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Structures in Riak

Data Structures in Riak

Since the beginning, Riak has supported high write-availability using Dynamo-style multi-valued keys – also known as conflicts or siblings. The tradeoff for this type of availability is that the application must include logic to resolve conflicting updates. While it is convenient to say that the application can reason best about conflicts, ad hoc resolution is error-prone and can result in surprising anomalies, like the reappearing item problem in Dynamo’s shopping cart.

What is needed is a more formal and general approach to the problem of conflict resolution for complex data structures. Luckily, there are some formal strategies in recent literature, including Conflict-Free Replicated Data Types (CRDTs) and BloomL lattices. We’ll review these strategies and cover some recent work we’ve done toward adding automatically-convergent data structures to Riak.

Basho Technologies

October 11, 2012
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. Data Structures in
    Russell Brown
    Sean Cribbs

    View full-size slide

  2. Riak is
    Eventually-Consistent

    View full-size slide

  3. Eventual Consistency
    Replicated
    Loose coordination
    Convergence
    1
    2
    3

    View full-size slide

  4. ✔Fault-tolerant
    ✔Highly available
    ✔Low-latency
    Eventual is Good

    View full-size slide

  5. No clear winner!
    Throw one out?
    Keep both?
    Consistency?
    1
    2
    3
    B
    A

    View full-size slide

  6. No clear winner!
    Throw one out?
    Keep both?
    Consistency?
    1
    2
    3
    B
    A
    Cassandra

    View full-size slide

  7. No clear winner!
    Throw one out?
    Keep both?
    Consistency?
    1
    2
    3
    B
    A
    Cassandra
    Riak & Voldemort

    View full-size slide

  8. Conflicts!
    A!
    B!

    View full-size slide

  9. Siblings in Riak
    HTTP/1.1  300  Multiple  Choices
    X-­‐Riak-­‐Vclock:  
    a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH
    YVKbIhEUl
    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA==
    Vary:  Accept,  Accept-­‐Encoding
    Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  
    frantic)
    Date:  Fri,  30  Sep  2011  15:24:35  GMT
    Content-­‐Type:  text/plain
    Content-­‐Length:  102
    Siblings:
    16vic4eU9ny46o4KPiDz1f
    4v5xOg4bVwUYZdMkqf0d6I
    6nr5tDTmhxnwuAFJDd2s6G
    6zRSZFUJlHXZ15o9CG0BYl

    View full-size slide

  10. Siblings in Riak
    HTTP/1.1  300  Multiple  Choices
    X-­‐Riak-­‐Vclock:  
    a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzH
    YVKbIhEUl
    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA==
    Vary:  Accept,  Accept-­‐Encoding
    Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  
    frantic)
    Date:  Fri,  30  Sep  2011  15:24:35  GMT
    Content-­‐Type:  text/plain
    Content-­‐Length:  102
    Siblings:
    16vic4eU9ny46o4KPiDz1f
    4v5xOg4bVwUYZdMkqf0d6I
    6nr5tDTmhxnwuAFJDd2s6G
    6zRSZFUJlHXZ15o9CG0BYl
    list of siblings

    View full-size slide

  11. Siblings in Riak
    HTTP/1.1  300  Multiple  Choices
    X-­‐Riak-­‐Vclock:  
    a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl
    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA==
    Vary:  Accept,  Accept-­‐Encoding
    Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic)
    Date:  Fri,  30  Sep  2011  15:24:35  GMT
    Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Length:  766
    -­‐-­‐YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded
    Link:  ;  rel="up"
    Etag:  16vic4eU9ny46o4KPiDz1f
    Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT
    {"bar":"baz"}
    -­‐-­‐YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Type:  application/json
    Link:  ;  rel="up"
    Etag:  4v5xOg4bVwUYZdMkqf0d6I
    Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT
    {"bar":"baz"}
    -­‐-­‐YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Type:  application/json
    Link:  ;  rel="up"

    View full-size slide

  12. Siblings in Riak
    HTTP/1.1  300  Multiple  Choices
    X-­‐Riak-­‐Vclock:  
    a85hYGDgyGDKBVIszMk55zKYEhnzWBlKIniO8kGF2TyvHYIKf0cIszUnMTBzHYVKbIhEUl
    +VK4spDFTPxhHzFyqhEoVQz7wkSAGLMGuz6FSocFIUijE3pt5HlsgCAA==
    Vary:  Accept,  Accept-­‐Encoding
    Server:  MochiWeb/1.1  WebMachine/1.9.0  (participate  in  the  frantic)
    Date:  Fri,  30  Sep  2011  15:24:35  GMT
    Content-­‐Type:  multipart/mixed;  boundary=YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Length:  766
    -­‐-­‐YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Type:  application/x-­‐www-­‐form-­‐urlencoded
    Link:  ;  rel="up"
    Etag:  16vic4eU9ny46o4KPiDz1f
    Last-­‐Modified:  Wed,  10  Mar  2010  18:01:06  GMT
    {"bar":"baz"}
    -­‐-­‐YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Type:  application/json
    Link:  ;  rel="up"
    Etag:  4v5xOg4bVwUYZdMkqf0d6I
    Last-­‐Modified:  Wed,  10  Mar  2010  18:00:04  GMT
    {"bar":"baz"}
    -­‐-­‐YinLMzyUR9feB17okMytgKsylvh
    Content-­‐Type:  application/json
    Link:  ;  rel="up"
    all the values

    View full-size slide

  13. Semantic Resolution
    • Your app knows the domain - use business
    rules to resolve
    • Amazon Dynamo’s shopping cart

    View full-size slide

  14. Semantic Resolution
    • Your app knows the domain - use business
    rules to resolve
    • Amazon Dynamo’s shopping cart
    BAD

    View full-size slide

  15. Semantic Resolution
    • Your app knows the domain - use business
    rules to resolve
    • Amazon Dynamo’s shopping cart
    BAD
    “Ad hoc approaches
    have proven brittle
    and error-prone”

    View full-size slide

  16. Goals
    ✔Meaningful values
    ✔Automatic resolution
    ✔Transparent to user

    View full-size slide

  17. WARNING
    This is a lot of math.
    Side effects may include dry mouth, itchy
    rash, and a desire to go back for a PhD.

    View full-size slide

  18. Monotonic Functions
    • Change in strictly a
    single direction
    • Consecutive values
    may be equal
    • Monotonic: Linear,
    Exponential
    • Non-monotonic:
    Quadratic, Sinusoidal

    View full-size slide

  19. Monotonic Functions
    • Change in strictly a
    single direction
    • Consecutive values
    may be equal
    • Monotonic: Linear,
    Exponential
    • Non-monotonic:
    Quadratic, Sinusoidal

    View full-size slide

  20. Monotonic Logic
    •Existing facts are never refuted
    •New facts can be added
    •“Knowledge only grows”

    View full-size slide

  21. Monotonic Logic
    •Existing facts are never refuted
    •New facts can be added
    •“Knowledge only grows”
    “monotonicity of entailment”

    View full-size slide

  22. http://db.cs.berkeley.edu/papers/UCB-lattice-tr.pdf

    View full-size slide

  23. Bounded Join Semi-Lattice
    ʪS, ⊔, ⊥ʫ

    View full-size slide

  24. Bounded Join Semi-Lattice
    ʪS, ⊔, ⊥ʫ
    S is a set

    View full-size slide

  25. Bounded Join Semi-Lattice
    ⊥ ∈ S (minimal element)
    ʪS, ⊔, ⊥ʫ
    S is a set

    View full-size slide

  26. Bounded Join Semi-Lattice
    ʪS, ⊔, ⊥ʫ
    ⊔ is a least-upper bound function
    ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z

    View full-size slide

  27. Bounded Join Semi-Lattice
    ∀x, y ∈ S: x ≤S y 㱻 x ⊔ y = y “partial order”
    ʪS, ⊔, ⊥ʫ
    ⊔ is a least-upper bound function
    ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z

    View full-size slide

  28. Bounded Join Semi-Lattice
    ∀x, y ∈ S: x ≤S y 㱻 x ⊔ y = y “partial order”
    ∀x ∈ S: x ⊔ ⊥ = x “identity”
    ʪS, ⊔, ⊥ʫ
    ⊔ is a least-upper bound function
    ∀x, y ∈ S, ∃z ∈ S: x ⊔ y = z

    View full-size slide

  29. “Set” Lattice
    S = all finite sets
    ⊔ = set-union
    ⊥ = {}

    View full-size slide

  30. “Set” Lattice
    {a} {b} {c} {d} {e}
    {a,b}
    {b,c} {c,d} {d,e}
    {a,b,c} {c,d,e}
    {b,c,d,e}
    {a,b,c,d}
    {b,c,d}
    {a,b,c,d,e}
    Time
    S = all finite sets
    ⊔ = set-union
    ⊥ = {}

    View full-size slide

  31. Vector Clock

    View full-size slide

  32. • Vector clock is a lattice...
    Vector Clock

    View full-size slide

  33. • Vector clock is a lattice...
    Vector Clock
    S = all vectors of (Actor, Count) pairs
    ⊔ = All Actors, each with their max Count
    ⊥ = [] (empty vector)

    View full-size slide

  34. • Vector clock is a lattice...
    • ...but the associated Riak value is non-
    monotonic,
    Vector Clock
    S = all vectors of (Actor, Count) pairs
    ⊔ = All Actors, each with their max Count
    ⊥ = [] (empty vector)

    View full-size slide

  35. • Vector clock is a lattice...
    • ...but the associated Riak value is non-
    monotonic,
    • ...and the vclock is not meaningful to the client.
    Vector Clock
    S = all vectors of (Actor, Count) pairs
    ⊔ = All Actors, each with their max Count
    ⊥ = [] (empty vector)

    View full-size slide

  36. http://hal.inria.fr/docs/00/55/55/88/PDF/techreport.pdf

    View full-size slide

  37. CRDT Flavors
    • Convergent (state-based)
    • One replica updates, then forwards entire
    state, downstream merges
    • Commutative (operation-based)
    • Only mutations (ops) communicated
    • Needs a reliable broadcast channel

    View full-size slide

  38. CRDT Types
    Registers
    LWW, MV
    Counters
    Positive, P/N
    Sets
    Grow only,
    Two-Phase,
    Observed-Remove
    Graphs
    2P-2P
    Lists
    Growable-array
    Collaborative editing
    Treedoc

    View full-size slide

  39. Theory Into
    Practice

    View full-size slide

  40. Riak DT
    •Riak Core Application
    •Runs alongside Riak KV
    •Own Storage

    View full-size slide

  41. •HTTP API
    •-­‐behaviour(riak_dt).
    •State-based
    Riak DT

    View full-size slide

  42. •new/0 empty CRDT
    •value/1 the resolved value
    •update/3 mutate CRDT
    •merge/2 converge two CRDTs
    •equal/2 compare internal value
    CRDT Behaviour

    View full-size slide

  43. •Counters
    •G-Counter
    •PN-Counter
    •Sets
    •G-Set
    •OR-Set
    CRDTs implemented

    View full-size slide

  44. G-Counter
    •Simple version vector (28 LoC)
    [{ActorId,Count}]
    •Update: increment actor’s count
    •Merge: greatest value per Actor
    •Value: sum of Counts

    View full-size slide

  45. G-Counter
    new()  -­‐>
           [].
    value(GCnt)  -­‐>
           sum([Cnt  ||  {_Act,  Cnt}  <-­‐  GCnt]).
    equal(VA,VB)  -­‐>
           lists:sort(VA)  =:=  lists:sort(VB).

    View full-size slide

  46. PN-Counter
    •2 x G-Counter
    •P - N = value
    {
       P  =  [{a,10},{b,2}],
       N  =  [{a,1},{c,5}]
    }
    (10  +  2)  -­‐  (1  +  5)  
       =  12  -­‐  6  
       =  6

    View full-size slide

  47. Riak DT In Action
    •Bitcask storage per vnode
    •Value / Update FSM per request
    •Webmachine resource(s)
    e.g. GET  /counters/key

    View full-size slide

  48. Update FSM
    •Sync call update on vnode
    •Read, Local Update, Reply
    •Async send merge to replicas
    •Await W responses
    •Reply to client

    View full-size slide

  49. Value FSM (Read)
    •Async call value on all replicas
    •Await R replies
    •Merge all replies with merge/2
    •Return merged value to client
    •Read Repair

    View full-size slide

  50. Read Repair
    •Compare answers to merged result
    using equal/2
    •Send merge to stale replicas

    View full-size slide

  51. Multi-Datacenter
    •Behaviour addition
    •rollup/2 collapsed local view
    •Counters
    •Roll up all actors in cluster:
    [{ClusterId,Count}]

    View full-size slide

  52. Trade-Offs
    •Update: Primary only
    •Secondary/Fallbacks may Merge
    •Read-before-Write in the request
    path
    •PW=DW=1 by default

    View full-size slide

  53. Garbage!
    •Counters
    •Dead actors
    •Sets
    •Tombstones

    View full-size slide

  54. Elegance = Punt
    •GC is non-
    monotonic!
    •Needs consensus
    to collect

    View full-size slide

  55. And then?
    •Stats/Metrics & Polish
    •Multi-Datacenter Replication
    •Active Anti-Entropy

    View full-size slide

  56. And then?
    •KV as storage
    •GC / low garbage datatypes
    •Op based / hybrid

    View full-size slide

  57. Open Source Today
    Insert screenshot here

    View full-size slide

  58. Questions?
    @russeldb @seancribbs

    View full-size slide