Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Just-Right Consistency - Closing the CAP Gap

Just-Right Consistency - Closing the CAP Gap

J on the Beach 2017

Christopher Meiklejohn

May 18, 2017
Tweet

More Decks by Christopher Meiklejohn

Other Decks in Research

Transcript

  1. Outline: Closing the CAP Gap • Just-Right Consistency
 Available as

    possible, and consistent when necessary • AntidoteDB
 The first database that provides transactions with strong semantics, targeted at the JRC approach 2
  2. Outline: Closing the CAP Gap • Just-Right Consistency
 Available as

    possible, and consistent when necessary • AntidoteDB
 The first database that provides transactions with strong semantics, targeted at the JRC approach • Moving forward
 Antidote’s path forward from research to company and product 2
  3. A Clients read and write against the primary copy. [Photo:

    http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  4. A B C Geo-replicated for both fault-tolerance and high-availability. [Photo:

    http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  5. A B C Clients read and write locally for low-latency.

    [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  6. A B C What happens if C can’t communicate with

    other replicas? [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  7. A B C Choice 1: Consistent-Under-Partition (CP) • Synchronize each

    operation
 Maintains “single system image” [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  8. A B C Choice 1: Consistent-Under-Partition (CP) • Synchronize each

    operation
 Maintains “single system image” • Spanner/F1, serializability model
 Coordination is expensive; Spanner typically has to wait 100ms to commit an update transaction [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  9. A B C Choice 1: Consistent-Under-Partition (CP) • Synchronize each

    operation
 Maintains “single system image” • Spanner/F1, serializability model
 Coordination is expensive; Spanner typically has to wait 100ms to commit an update transaction Over-conservative, but easy to program! [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  10. A B C Choice 2: Available-Under-Partition (AP) • Riak, Cassandra,

    Dynamo
 Operations issued against local copy, and across the cluster in parallel [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  11. A B C Choice 2: Available-Under-Partition (AP) • Riak, Cassandra,

    Dynamo
 Operations issued against local copy, and across the cluster in parallel • Local operation only, asynchronous propagation
 Stale reads and write conflicts will occur without synchronization [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  12. A B C Choice 2: Available-Under-Partition (AP) • Riak, Cassandra,

    Dynamo
 Operations issued against local copy, and across the cluster in parallel • Local operation only, asynchronous propagation
 Stale reads and write conflicts will occur without synchronization Available, but difficult to program! [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  13. A B C CAP Theorem High cost CP AP [Photo:

    http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  14. A B C CAP Theorem High cost Low availability CP

    AP [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  15. A B C CAP Theorem High cost Low availability Synchronization

    CP AP [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  16. A B C CAP Theorem High cost Low availability Synchronization

    Low cost CP AP [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  17. A B C CAP Theorem High cost Low availability Synchronization

    Low cost High availability CP AP [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  18. A B C CAP Theorem High cost Low availability Synchronization

    Low cost High availability Anomalies CP AP [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  19. A B C CAP Theorem High cost Low availability Synchronization

    Low cost High availability Anomalies CP AP False dichotomy! [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452]
  20. A B C CAP Theorem High cost Low availability Synchronization

    Low cost High availability Anomalies CP AP False dichotomy! [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452] • No “one-size-fits-all” consistency model
 Choosing either model will either be over-conservative or risk anomalies
  21. A B C CAP Theorem High cost Low availability Synchronization

    Low cost High availability Anomalies CP AP False dichotomy! [Photo: http://vignette3.wikia.nocookie.net/the-titans-rp-and-information/images/f/f5/Blank-World-map2.gif/revision/latest/scale-to-width-down/1280?cb=20141016203452] • No “one-size-fits-all” consistency model
 Choosing either model will either be over-conservative or risk anomalies • Application-level invariants
 Instead, tailor consistency choices based on application- level invariants for each operation
  22. Just Right Consistency • Preserve sequential patterns
 Applications written sequentially

    that are correct should maintain correctness under concurrency 13
  23. Just Right Consistency • Preserve sequential patterns
 Applications written sequentially

    that are correct should maintain correctness under concurrency • AP-compatible invariants
 Strongest AP model; invariants that only require “one way” communications 13
  24. Just Right Consistency • Preserve sequential patterns
 Applications written sequentially

    that are correct should maintain correctness under concurrency • AP-compatible invariants
 Strongest AP model; invariants that only require “one way” communications • CAP-sensitive invariants
 Transactions that require coordination; “two way” communication invariants 13
  25. Just Right Consistency • Preserve sequential patterns
 Applications written sequentially

    that are correct should maintain correctness under concurrency • AP-compatible invariants
 Strongest AP model; invariants that only require “one way” communications • CAP-sensitive invariants
 Transactions that require coordination; “two way” communication invariants • Tools for analysis and verification
 Identify and verify application has sufficient synchronization to ensure application invariants 13
  26. Fælles Medicinkort • FMK [production] / FMKe [synthetic workload]
 Danish

    National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens 15
  27. Fælles Medicinkort • FMK [production] / FMKe [synthetic workload]
 Danish

    National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens • Lifecycle management for prescriptions
 Involves patient, pharmacy, and doctor management around active prescriptions in Denmark 15
  28. Fælles Medicinkort • FMK [production] / FMKe [synthetic workload]
 Danish

    National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens • Lifecycle management for prescriptions
 Involves patient, pharmacy, and doctor management around active prescriptions in Denmark • Assumed correct in isolation
 “Correct-Individually”, C in ACID, each operation ensures application-level invariants 15
  29. Fælles Medicinkort • FMK [production] / FMKe [synthetic workload]
 Danish

    National Joint Medicine Card; operating 24x7 since 2013 for 6 million Danish citizens • Lifecycle management for prescriptions
 Involves patient, pharmacy, and doctor management around active prescriptions in Denmark • Assumed correct in isolation
 “Correct-Individually”, C in ACID, each operation ensures application-level invariants 15 • create-prescription
 Create prescription for patient, doctor, pharmacy • update-prescription-medication
 Add or increase medication to prescription • process-prescription
 Deliver a medication by a pharmacy • get-*-prescriptions
 Query functions to return information about prescriptions
  30. FMKe Invariants • Relative order [secondary indexes]
 Create a prescription

    and reference it by a patient • Joint update [atomicity]
 Create prescription, then update doctor, patient, and pharmacy 16
  31. FMKe Invariants • Relative order [secondary indexes]
 Create a prescription

    and reference it by a patient • Joint update [atomicity]
 Create prescription, then update doctor, patient, and pharmacy • Precondition check [if, then]
 Medication should not be over delivered 16
  32. AP-compatible • No synchronization
 Updates occur locally without blocking, no

    synchronization in the critical path • Asynchronous operation
 Updates are fast, available, and exploit concurrency 18
  33. AP-compatible • No synchronization
 Updates occur locally without blocking, no

    synchronization in the critical path • Asynchronous operation
 Updates are fast, available, and exploit concurrency • Compatible invariants
 Relative order and joint update invariants can be preserved 18
  34. RA RB 1 set(1) 3 2 set(2) set(3) 2 3

    Concurrent assignments don’t commute!
  35. RA RB 1 set(1) 3 2 set(2) set(3) 2 3

    Concurrent assignments don’t commute! Assignment requires CP.
  36. RA RB 1 set(1) 3 2 set(2) set(3) ? ?

    How do we deterministically pick a value to keep?
  37. RA RB 1 set(1) 3 2 set(2) set(3) ? ?

    How do we deterministically pick a value to keep? Do we use a timestamp? (like Cassandra, and drop a value?)
  38. RA RB 1 set(1) 3 2 set(2) set(3) ? ?

    How do we deterministically pick a value to keep? Do we use a timestamp? (like Cassandra, and drop a value?) Timestamps make concurrent operations commute but fail to capture intent.
  39. RA RB 1 set(1) 3 2 set(2) set(3) 3 3

    max(2,3) max(2,3) Deterministic conflict resolution function.
  40. RA RB 1 set(1) 3 2 set(2) set(3) 3 3

    max(2,3) max(2,3) Deterministic conflict resolution function. CRDTs generalize this framework.
  41. Conflict-Free 
 Replicated Data Types • Replicated abstract data types


    Extension of sequential data type that encapsulates deterministic merge function 28
  42. Conflict-Free 
 Replicated Data Types • Replicated abstract data types


    Extension of sequential data type that encapsulates deterministic merge function • Many existing designs
 Sets, counters, registers, flags, maps 28
  43. Causal Consistency • Respect causality
 Ensure updates are delivered in

    the causal order • Strongest available AP model
 Always able to return some compatible version for an object 44
  44. Causal Consistency • Respect causality
 Ensure updates are delivered in

    the causal order • Strongest available AP model
 Always able to return some compatible version for an object • Secondary indexing
 Causal consistency is sufficient for providing secondary indexing in an AP database 44
  45. RA RB C1 Rx create Rx Dr update Dr(Rx) Add

    reference in doctor record.
  46. RA RB C1 Rx create Rx Dr update Dr(Rx) Pt

    update Pt(Rx) Add reference in patient record.
  47. RA RB C1 Rx create Rx Dr update Dr(Rx) Pt

    update Pt(Rx) Ph update Ph(Rx) Add reference in pharmacy record.
  48. RA RB C1 Rx create Rx Dr update Dr(Rx) Pt

    update Pt(Rx) Ph update Ph(Rx) Updates are causally consistent.
  49. RA RB C1 Rx create Rx Dr update Dr(Rx) Pt

    update Pt(Rx) Ph update Ph(Rx) Client can read inconsistent state.
  50. RA RB C1 Rx create Rx Dr update Dr(Rx) Pt

    update Pt(Rx) Ph update Ph(Rx) Client is missing update to pharmacy.
  51. RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx)

    update Ph(Rx) Group updates into an atomic transaction.
  52. RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx)

    update Ph(Rx) Updates reflect “All-Or-Nothing” property through snapshots.
  53. RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx)

    update Ph(Rx) T2 Transactions are delivered in causal order.
  54. RA RB C1 T1 create Rx update Dr(Rx) update Pt(Rx)

    update Ph(Rx) T2 Therefore, snapshots are causally consistent.
  55. RA(2) RB(2) 1 1 1 pp(1) RC(2) 1 Replica A

    checks precondition and delivers medication.
  56. RA(2) RB(2) 1 1 1 pp(1) RC(2) 1 Correct outcome

    where one medication remains.
  57. RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)

    Replica A checks precondition and delivers medication.
  58. RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)

    Replica C adds three medications to the prescription.
  59. RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)

    Correct outcome with four remaining medications.
  60. RA(2) RB(2) 4 4 1 pp(1) RC(2) 4 4 add(3)

    Correct outcome with four remaining medications. Precondition is stable under concurrent addition.
  61. RA(2) RB(2) -1 -1 1 pp(1) RC(2) -1 0 pp(2)

    Replica A checks precondition and delivers medication.
  62. RA(2) RB(2) -1 -1 1 pp(1) RC(2) -1 0 pp(2)

    Replica C concurrently checks precondition and delivers two medications.
  63. RA(2) RB(2) -1 -1 1 pp(1) RC(2) -1 0 pp(2)

    Incorrect outcome violating non-negative invariant.
  64. RA(2) RB(2) -1 -1 1 pp(1) RC(2) -1 0 pp(2)

    Incorrect outcome violating non-negative invariant. Precondition is NOT stable under concurrent fulfillment.
  65. RA(2) RB(2) -1 -1 1 pp(1) RC(2) -1 0 pp(2)

    Incorrect outcome violating non-negative invariant. Precondition is NOT stable under concurrent fulfillment. • Forbid concurrency
 Prevent operations from proceeding without synchronization to enforce invariant • Allow concurrency and remove invariant
 Allow operation to proceed, knowing that the invariant may be violated under concurrent operations
  66. RA RB I? I? ? Upre? RC I? ? Vpre?

    Analyze possible pairs of concurrent operations…
  67. RA RB I? I? ? Upre? RC I? ? Vpre?

    …to identify operations where the invariant can be violated.
  68. CISE Analysis • Individually correct
 Individual operations never violate the

    invariant • Convergence
 Concurrent effects commute 81
  69. CISE Analysis • Individually correct
 Individual operations never violate the

    invariant • Convergence
 Concurrent effects commute • Precondition stability
 Preconditions are stable under every pair of concurrent operations 81
  70. CISE Analysis • Individually correct
 Individual operations never violate the

    invariant • Convergence
 Concurrent effects commute • Precondition stability
 Preconditions are stable under every pair of concurrent operations 81 If satisfied, invariant is guaranteed with concurrency.
  71. AntidoteDB • Open-source Erlang database
 Developed in Erlang, on top

    of the Riak Core distributed systems framework 83
  72. AntidoteDB • Open-source Erlang database
 Developed in Erlang, on top

    of the Riak Core distributed systems framework • Transactional Causal Consistency
 Only industrial-grade database providing both causal consistency and all-or-nothing transactions 83
  73. AntidoteDB • Open-source Erlang database
 Developed in Erlang, on top

    of the Riak Core distributed systems framework • Transactional Causal Consistency
 Only industrial-grade database providing both causal consistency and all-or-nothing transactions • Alpha release available
 Currently under development, but an alpha release of the product is available on GitHub 83
  74. A B N1 N2 TxnMgr Materializer Log InterDC-Repl …each operating

    a transaction manager, materializers, log.
  75. A B N1 N2 TxnMgr Materializer Log InterDC-Repl …with a

    causal consistency protocol running in the wide area.
  76. Data Model 89 Register • Last-Writer Wins • Multi-Value Set

    • Grow-Only • Add-Wins • Remove-Wins Map Counter • Unlimited • Restricted ≥ 0 Graph • Directed • Monotonic DAG • Edit graph Sequence
  77. Object API 90 User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2}

    = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “[email protected]”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]).
  78. Object API 91 User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2}

    = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “[email protected]”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]). Identify an object by object identifier.
  79. Object API 92 User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2}

    = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “[email protected]”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]). Use the update API to assign a value to this register.
  80. Object API 93 User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2}

    = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “[email protected]”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]). Read the object, providing a minimum snapshot time.
  81. Object API 93 User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2}

    = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “[email protected]”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]). Read the object, providing a minimum snapshot time. Simple, operation-based API. (think Redis, Riak CRDTs)
  82. Object API 93 User1 = {michel, antidote_crdt_mvreg, user_bucket}, {ok, Time2}

    = antidote:update_objects(ignore, [], [{User1, assign, {["Michel", “[email protected]”], ClientIdentifier}}]), {ok, Result, Time2} = antidote:read_objects( ignore, [], [User1]). Read the object, providing a minimum snapshot time. Simple, operation-based API. (think Redis, Riak CRDTs) Causal dependencies are automatically captured by execution order.
  83. Transaction API 94 {ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _}

    = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
  84. Transaction API 95 Start a transaction with the transaction API,

    with a given snapshot time and return a transaction identifier. {ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId), ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId).
  85. {ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),

    ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId). Transaction API 96 Read objects using the interactive transaction API.
  86. {ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),

    ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId). Transaction API 97 Update objects using the interactive transaction API.
  87. {ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),

    ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId). Transaction API 98 Once finished updating, commit the transaction.
  88. {ok, TxId} = antidote:start_transaction(Timestamp, []), {ok, _} = antidote:read_objects([Set], TxId),

    ok = antidote:update_objects([{Set, add, "Java"}], TxId), {ok, _} = antidote:commit_transaction(TxId). Transaction API 98 Once finished updating, commit the transaction. Transactions read causally consistent snapshots and updates are applied atomically.
  89. Scalability 99 Kops / s 100 200 300 400 500

    600 700 800 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25 1 x 5 1 x 10 1 x 25 2 x 25 3 x 25 99(1) 90(10) 75(25) 50(50) read(update) ratio DCs × Servers LWW registers 100k keys/partition power law distribution
  90. Cure vs. SOA 100 Kops / s 0 100 200

    300 400 500 600 700 800 900 1000 1100 Eiger GR Cure EC Eiger GR Cure EC Eiger GR Cure EC Eiger GR Cure EC 99(1) 90(10) 75(25) 50(50) read(update) ratio 3 DCs × 25 Servers LWW registers
  91. Cure vs. EC 101 Kops / s 100 200 300

    400 500 600 700 800 900 1000 1100 1200 Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB Cure, 1KB EC, 1KB Cure, 10KB EC, 10KB 99(1) 90(10) 75(25) 50(50) read(update) ratio 3 DCs x 25 Servers CRDT sets
  92. Future Features • Intra-DC replication
 Antidote provides no replication within

    the datacenter and assumes only geo- replication at the moment 102
  93. Future Features • Intra-DC replication
 Antidote provides no replication within

    the datacenter and assumes only geo- replication at the moment • ACID transactions
 For Antidote to provide all of JRC, it needs ACID transaction support: no research needed, only implementation 102
  94. Moving Forward • Research prototype
 Originally a research prototype to

    build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork 103
  95. Moving Forward • Research prototype
 Originally a research prototype to

    build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork • Research ahead
 LightKone (H2020) will investigate moving AntidoteDB close to the edge to provide DDN services 103
  96. Moving Forward • Research prototype
 Originally a research prototype to

    build a database requiring reduced synchronization (SyncFree FP7) with Basho, Rovio, and Trifork • Research ahead
 LightKone (H2020) will investigate moving AntidoteDB close to the edge to provide DDN services • Industrialization
 Obtaining seed funding to start a company to industrialize AntidoteDB 103