Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An audiovisual précis of 'A comprehensive study of Convergent and Commutative Replicated Data Types'

An audiovisual précis of 'A comprehensive study of Convergent and Commutative Replicated Data Types'

Presented at Papers We Love NYC #4: http://www.meetup.com/papers-we-love/events/175964662/

Chas Emerick

May 15, 2014
Tweet

More Decks by Chas Emerick

Other Decks in Technology

Transcript

  1. Chas Emerick
    @cemerick @QuiltProject
    Papers We Love NYC #4
    May 15, 2014
    for your edification and entertainment, an audiovisual précis of:
    A comprehensive study of
    Convergent and Commutative
    Replicated Data Types
    by Shapiro, Preguiça, Baquero, and Zawirski (2011)

    View Slide

  2. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Preface
    ● “Who is this guy?”
    ● Not going to follow the order of topics from the
    paper exactly
    ● Key topics will be introduced along with section
    numbers from the paper (e.g. §4.3)
    ● I'll be drawing upon materials used in the authors'
    presentations related to this paper (see references
    for links)

    View Slide

  3. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Itinerary
    ● The paper
    – Motivating problem
    – Theoretically- and algebraically-sound solution
    – Specification of practical data type designs
    – Challenges
    – Related work, further reading
    ● Impact outside of academia
    ● Questions & discussion

    View Slide

  4. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Actors within distributed systems
    exchange and share state
    A t
    B
    x
    a
    x
    a
    x
    b
    x
    b
    x
    c
    x
    c

    View Slide

  5. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    The problem
    Conflicting concurrent modifications
    A t
    B
    x
    x x
    Δb
    x
    Δa
    x
    Δa
    ∥x
    Δb
    Δ
    b
    Δ
    a
    x
    ???

    View Slide

  6. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    The problem
    Conflicting concurrent modifications require
    consensus
    A t
    B
    x
    x x
    Δab
    x
    Δa
    x
    Δab
    Δ
    b
    Δ
    a
    x
    Δa

    View Slide

  7. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    The problem
    Conflicting concurrent modifications require
    consensus, which is partition-intolerant and affects
    availability:
    A t
    B
    x
    x x
    Δab
    x
    Δa
    x
    Δab
    Δ
    b
    Δ
    a
    x
    Δa

    View Slide

  8. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Linearizability✝
    ● Transactional databases, Redis, consensus services (Paxos/Raft)
    ● Global consensus → consistency
    ● Total order of all events
    ● Very expensive & constrains availability
    A t
    B
    x
    x x
    Δab
    x
    Δa
    x
    Δab
    Δ
    b
    Δ
    a
    x
    Δa
    ✝As well as strict serializability, which has even stronger guarantees.

    View Slide

  9. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Thus, eventual consistency
    ● No guarantee of ordering of events
    ● Maximal availability & performance
    ● How to reconcile results from concurrent
    operations?
    A t
    B
    x
    x x
    Δb
    x
    Δa
    x
    Δa
    ∥x
    Δb
    Δ
    b
    Δ
    a
    x
    ???

    View Slide

  10. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    How to reconcile results of
    concurrent operations?
    ● “Background” (deferred) consensus
    – Post-hoc resolution or rollback of conflicting updates
    ● This is what we do today, all the time!
    – Resolving CouchDB conflicts and Riak siblings within
    applications
    – Merging (semi-)textual content via diffs
    ● Very difficult to implement correctly, and no
    guiding formalisms to indicate correctness or warn
    against problems

    View Slide

  11. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Proposed solution
    Conflict-free Replicated Data Types (CRDTs)
    deterministically reconcile concurrent updates
    such that no conflicts arise
    – Performance, availability, scale of eventual consistency
    + reliable reconciliation as if you were using a
    consensus mechanism
    – Provably sound
    – Limitations:
    ● No consensus → limitations on what can be stored,
    replicated, and reconciled (i.e. no global invariants)
    ● Unbounded growth → “garbage collection”

    View Slide

  12. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Chocolate and vanilla CRDTs
    ● What is replicated?
    – Entire state of the datatype? State-based, a.k.a.
    convergent replicated data type, a.k.a. CvRDT
    – Individual operations (+ arguments)? Operation-based,
    a.k.a. commutative replicated data type a.k.a. CmRDT
    – Options correspond to the two strategies for
    implementing optimistic replication✝
    ● These are formally equivalent §2.4
    – Strategy: understand state-based constructions,
    move on to operation-based as optimization
    ✝http://research.microsoft.com/apps/pubs/default.aspx?id=66979

    View Slide

  13. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    State­based (CvRDT)
    §2.2.1, §2.3.1
    All possible states within a
    CvRDT form a semilattice
    – Partially-ordered set
    established by a least upper
    bound (join, ) or greatest

    lower bound (meet, )

    – Both relations are
    definitionally commutative,
    associative, and idempotent
    – Each application of join or
    meet yields a monotonically
    increasing or decreasing value
    {b}
    ø
    {a}
    {a}
    {a,b}
    t

    View Slide

  14. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    State­based (CvRDT)
    ● join and meet are formally equivalent, join
    presumed throughout the literature
    ● Update locally, propagate results to other replicas,
    where it must converge (the 'v' in “CvRDT”)
    ● Requires weakest eventual consistency guarantees
    to yield convergence among all replicas, since join
    is associative and commutative
    – “infinitely often” transmission of state
    – Insensitive to reordered/dropped/repeated messages
    – Very expensive worst case, but easier to reason about

    View Slide

  15. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Language for specifying
    asynchronous replication
    ● More than boxes, arrows
    ● Better than (most) pseudocode: explicit about
    preconditions, where things happen, (a)synchrony, etc

    View Slide

  16. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Language for specifying
    asynchronous replication

    View Slide

  17. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    A “Portfolio of basic CRDTs” §3
    ● Counters

    Registers

    Sets
    ● Sequences
    ● CRDTs compose and retain their characteristics
    – Sets → maps, multimaps, graphs

    View Slide

  18. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Designing a register CRDT✝
    ● Sequential specification:
    – R.set(v) → R.get() == v

    Join relation over R.set(v
    a
    ) || R.set(v
    b
    )
    – Linearizable?
    – Error state?
    – Last writer wins?
    ✝Framework from http://bit.ly/shapiro-msr-talk

    View Slide

  19. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    LWW (last writer wins) register §3.2.1
    ● Ensures only a single value in register
    ● Semilattice is ordered by timestamps
    A t
    B

    y
    t2
    x
    t1
    y
    t2
    Δ
    t2
    Δ
    t1
    ∅ y
    t2

    View Slide

  20. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Mapping LWW­register to its semilattice
    A t
    B

    y
    t2
    x
    t1
    y
    t2
    Δ
    t2
    Δ
    t1
    ∅ y
    t2
    [t
    2
    ,y]
    ø
    ø
    [t
    1
    ,x]
    [t
    2
    ,y]
    t

    View Slide

  21. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    MV (multi­value) register §3.2.2
    ● Assignments carry causal history (e.g. version
    vector) which defines semilattice's partial order
    ● join retains all values assigned concurrently;
    some client can later assign a single value
    A t
    B

    y
    Δb
    x
    Δa
    x
    Δa
    ∥y
    Δb
    Δ
    b
    Δ
    a
    ∅ x
    Δa
    ∥y
    Δb
    client
    z
    [Δa,Δb,Δc]
    z
    [Δa,Δb,Δc]

    View Slide

  22. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Mapping MV­register to its semilattice
    [Δb,y]
    ø
    [Δa,x]
    #{[Δa,x]
    [Δb,y]}
    t
    [[Δa,Δb,Δc],z]
    A t
    B

    y
    Δb
    x
    Δa
    x
    Δa
    ∥y
    Δb
    Δ
    b
    Δ
    a
    ∅ x
    Δa
    ∥y
    Δb
    client
    z
    [Δa,Δb,Δc]
    z
    [Δa,Δb,Δc]

    View Slide

  23. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Designing a set CRDT✝
    ● Sequential specification:
    – S.add(e) → S.contains(e) == true
    – S.remove(e) → S.contains(e) == false
    ● Join relation over S.add(e) || S.remove(e)
    – Linearizable?
    – Disallow removals?
    – Error state?
    – Last writer wins?
    – Add wins?
    – Remove wins?
    ✝Framework from http://bit.ly/shapiro-msr-talk

    View Slide

  24. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Sets §3.3
    ● Counterintuitive convergent characterizations
    – G-Set (“grow-only”): can add, cannot remove
    – 2P-Set (“two phase”): once removed, cannot
    add an element back
    ● Composition of two G-sets
    – LWW-Set
    – PN-Set (positive & negative counters track
    membership): addition may not yield
    membership

    View Slide

  25. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Observed­Remove Set §3.3.5
    ● Set CRDT with intuitive semantics

    – Given S.add(e) || S.remove(e), add wins
    ● Strategy
    – tag each element uniquely (per actor or per
    operation), e
    τ
    – operation removing e must include set of all
    previously-unremoved τ for e
    – Set.contains(e) == true iff an e
    τ
    exists where
    τ has not been implicated in a removal of e
    ● Tags are not exposed in userland API

    View Slide

  26. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Observed­Remove Set
    Progression & Semilattice
    e
    a
    A
    e
    b
    B
    e
    b
    C e
    a,b
    e
    a,-a
    e
    a,-a,b
    {e} ∅ e
    b,a,-a
    {e}
    {e}
    {e}


    ∅ {e} {e}
    t

    A and B concurrently add e
    a
    and e
    b

    A removes e
    a
    ; this has no effect on e's membership in
    C's view of the set because of its knowledge of e
    b

    e
    b
    is eventually replicated to A, yielding consistency

    View Slide

  27. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Graphs §3.4
    ● Two sets, vertices + edges
    ● Many different possible constructions given the
    local invariants one might want to preserve
    between edges and vertices
    ● Global invariants cannot be guaranteed because of
    concurrent operations
    – e.g. cannot prevent cycles

    View Slide

  28. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Sequences §3.5.2
    ● Set of (identifier, value) where identifiers are
    selected from a dense, totally-ordered set
    ● Explored deeply in papers on Logoot and Treedoc
    CRDTs

    View Slide

  29. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Operation­based CmRDT
    §2.2.2, §2.3.2
    ● Requires “reliable broadcast channel”
    – Operations delivered to each replica in causal order <
    d
    – All concurrent operations that are unordered with
    respect to <
    d
    must commute (the 'm' in CmRDT)
    ● Far more efficient than worst-case state-based
    specification

    View Slide

  30. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Operation­based CmRDT: tradeoffs
    ● More complex, more difficult to reason about
    ● More challenging to implement
    – Causal relationships between operations must
    be identified + maintained
    – Generally requires tracking “group membership”

    View Slide

  31. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Garbage Collection §4
    ● “Garbage”: additional overhead that accumulates in order
    to satisfy CRDT semantics
    – “tombstones” (e.g. remove tags in an OR-Set)
    – Unbalanced trees of identifiers in sequences
    ● Optimistically collecting garbage and rolling back as
    necessary is an option in some cases
    ● Others appear to require various levels of consensus to
    achieve
    ● “Garbage” is not always waste
    – The right kind of tombstones are what makes
    consistent snapshot possible

    View Slide

  32. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Prior & related work
    Lots of prior work had portions of CRDTs'
    semantics, before “CRDT” was identified as a
    concept:
    – Wuu and Bernstein, 'Efficient solutions to the
    replicated log and dictionary problems' (1984!)
    – Operational transforms
    – Any Dynamo-style system uses registers for values
    ● LWW-registers: S3
    ● MV-registers: CouchDB conflicts, Riak siblings

    View Slide

  33. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Prior & related work
    “Consistency as Logical Monotonicity” (CALM theorem)
    – s/semilattices/monotonic logic
    ● Stricter semantics than semilattices; no way to
    characterize non-monotonic operations (remove,
    etc) without consensus
    – Implemented at the language level by Bloom
    ● Nearly all data structures are monotonic or lattices
    ● Allows for static analysis that identifies parts of
    your program that aren't monotonic (require
    synchronization/consensus mechanism to ensure
    safety)

    View Slide

  34. @papers_we_love NYC #4, 2014-05-15
    @cemerick / @QuiltProject
    Resources
    ● Meetup page for this talk: http://bit.ly/pwl-nyc-4
    ● Shapiro et al. paper: http://bit.ly/shapiro-crdt-pdf
    ● Shapiro talk @ MSR: http://bit.ly/shapiro-msr-talk
    ● Chris Meiklejohn's 'Readings in Distributed Systems':
    http://bit.ly/cmeik-dist-sys-readings
    ● CRDTs offered in v2.0 of Riak: http://bit.ly/riak-crdts
    ● Bloom, a Ruby DSL for “disorderly programming”, an
    implementation of CALM: http://www.bloom-lang.net
    ● The Quilt Project: http://quilt.org

    View Slide