A Brief History of Chain Replication

A Brief History of Chain Replication

Papers We Love, SF

3e09fee7b359be847ed5fa48f524a3d3?s=128

Christopher Meiklejohn

December 10, 2015
Tweet

Transcript

  1. A Brief History of Chain Replication Christopher Meiklejohn // @cmeik

    Papers We Love Too, December 10th, 2015 1
  2. None
  3. None
  4. None
  5. None
  6. Famous Computer Scientists Agree Chain Replication is Confusing

  7. Famous Computer Scientists Agree Chain Replication is Confusing

  8. The Overview 3

  9. The Overview 1. Chain Replication for High Throughput and Availability


    van Renesse & Schneider, OSDI 2004 3
  10. The Overview 1. Chain Replication for High Throughput and Availability


    van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3
  11. The Overview 1. Chain Replication for High Throughput and Availability


    van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 3
  12. The Overview 1. Chain Replication for High Throughput and Availability


    van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 3
  13. The Overview 1. Chain Replication for High Throughput and Availability


    van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 5. HyperDex: A Distributed, Searchable Key-Value Store
 Escriva et al., SIGCOMM 2011 3
  14. The Overview 1. Chain Replication for High Throughput and Availability


    van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 5. HyperDex: A Distributed, Searchable Key-Value Store
 Escriva et al., SIGCOMM 2011 6. ChainReaction: a Causal+ Consistent Datastore based on Chain Replication
 Almeida, Leitão, Rodrigues, Eurosys 2013 3
  15. The Overview 1. Chain Replication for High Throughput and Availability


    van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 5. HyperDex: A Distributed, Searchable Key-Value Store
 Escriva et al., SIGCOMM 2011 6. ChainReaction: a Causal+ Consistent Datastore based on Chain Replication
 Almeida, Leitão, Rodrigues, Eurosys 2013 7. Leveraging Sharding in the Design of Scalable Replication Protocols
 Abu-Libdeh, van Renesse, Vigfusson, SoCC 2013 3
  16. The Themes 4

  17. The Themes • Failure Detection 4

  18. The Themes • Failure Detection • Centralized Configuration Manager 4

  19. Chain Replication for High Throughput and Availability 5 van Renesse

    & Schneider OSDI 2004
  20. Storage Service API • V <- read(objId)
 Read the value

    for an object in the system 6
  21. Storage Service API • V <- read(objId)
 Read the value

    for an object in the system • write(objId, V)
 Write an object to the system 6
  22. Primary-Backup Replication • Primary-Backup
 Primary sequences all write operations and

    forwards them to a non-faulty replica 7
  23. Primary-Backup Replication • Primary-Backup
 Primary sequences all write operations and

    forwards them to a non-faulty replica • Centralized Configuration Manager
 Promotes a backup replica to a primary replica in the event of a failure 7
  24. Quorum Intersection Replication • Quorum Intersection
 Read and write quorums

    used to perform requests against a replica set, ensure overlapping quorums 8
  25. Quorum Intersection Replication • Quorum Intersection
 Read and write quorums

    used to perform requests against a replica set, ensure overlapping quorums • Increased performance
 Increased performance when you do not perform operations against every replica in the replica set 8
  26. Quorum Intersection Replication • Quorum Intersection
 Read and write quorums

    used to perform requests against a replica set, ensure overlapping quorums • Increased performance
 Increased performance when you do not perform operations against every replica in the replica set • Centralized Configuration Manager
 Establishes replicas, replica sets and quorums 8
  27. Chain Replication Contributions • High-throughput
 Nodes process updates in serial,

    responsibility of “primary” divided between the head and the tail nodes 9
  28. Chain Replication Contributions • High-throughput
 Nodes process updates in serial,

    responsibility of “primary” divided between the head and the tail nodes • High-availability
 Objects are tolerant to f failures with only f + 1 nodes 9
  29. Chain Replication Contributions • High-throughput
 Nodes process updates in serial,

    responsibility of “primary” divided between the head and the tail nodes • High-availability
 Objects are tolerant to f failures with only f + 1 nodes • Linearizability
 Total order over all read and write operations 9
  30. None
  31. Chain Replication Algorithm • Head applies update and ships state

    change
 Head performs the write operation and send the result down the chain where it is stored in replicas history 11
  32. Chain Replication Algorithm • Head applies update and ships state

    change
 Head performs the write operation and send the result down the chain where it is stored in replicas history • Tail “acknowledges” the request
 Tail node “acknowledges” the user and services write operations 11
  33. Chain Replication Algorithm • Head applies update and ships state

    change
 Head performs the write operation and send the result down the chain where it is stored in replicas history • Tail “acknowledges” the request
 Tail node “acknowledges” the user and services write operations • “Update Propagation Invariant”
 Reliable FIFO links for delivering messages, we can say that servers in a chain will have potentially greater histories than their successors 11
  34. None
  35. Failures? 13

  36. Failures? 13 Reconfigure Chains

  37. Chain Replication Failure Detection • Centralized Configuration Manager
 Responsible for

    managing the “chain” and performing failure detection 14
  38. Chain Replication Failure Detection • Centralized Configuration Manager
 Responsible for

    managing the “chain” and performing failure detection • “Fail-stop” failure model
 Processors fail by halting, do not perform an erroneous state transition, and can be reliably detected 14
  39. Chain Replication Reconfiguration • Failure of the head node
 Remove

    H replace with successor to H 15
  40. Chain Replication Reconfiguration • Failure of the head node
 Remove

    H replace with successor to H • Failure of the tail node
 Remove T replace with predecessor to T 15
  41. Chain Replication Reconfiguration • Failure of a “middle” node
 Introduce

    acknowledgements, and track “in-flight” updates between members of a chain 16
  42. Chain Replication Reconfiguration • Failure of a “middle” node
 Introduce

    acknowledgements, and track “in-flight” updates between members of a chain • “Inprocess Request Invariant”
 History of a given node is the history of its successor with “in-flight” updates 16
  43. 1 2 3 4 1 2 4 1 2 4

  44. Object Storage on CRAQ 18 Terrace & Freedman USENIX 2009

  45. CRAQ Motivation • CRAQ
 “Chain Replication with Apportioned Queries” 19

  46. CRAQ Motivation • CRAQ
 “Chain Replication with Apportioned Queries” •

    Motivation
 Read operations can only be serviced by the tail 19
  47. CRAQ Contributions • Read Operations
 Any node can service read

    operations for the cluster, removing hotspots 20
  48. CRAQ Contributions • Read Operations
 Any node can service read

    operations for the cluster, removing hotspots • Partitioning
 During network partitions: “eventually consistent” reads 20
  49. CRAQ Contributions • Read Operations
 Any node can service read

    operations for the cluster, removing hotspots • Partitioning
 During network partitions: “eventually consistent” reads • Multi-Datacenter Load Balancing
 Provide a mechanism for performing multi- datacenter load balancing 20
  50. CRAQ Consistency Models • Strong Consistency
 Per-key linearizability 21

  51. CRAQ Consistency Models • Strong Consistency
 Per-key linearizability • Eventual

    Consistency
 Read newest available version 21
  52. CRAQ Consistency Models • Strong Consistency
 Per-key linearizability • Eventual

    Consistency
 Read newest available version • “Session Guarantee”
 Monotonic read consistency for reads at a node 21
  53. CRAQ Consistency Models • Strong Consistency
 Per-key linearizability • Eventual

    Consistency
 Read newest available version • “Session Guarantee”
 Monotonic read consistency for reads at a node • Restricted Eventual Consistency
 Restricted with maximal bounded inconsistency based on versioning or physical time 21
  54. CRAQ Algorithm • Replicas store multiple versions for each object


    Each object contains version number and a dirty/clean status 22
  55. CRAQ Algorithm • Replicas store multiple versions for each object


    Each object contains version number and a dirty/clean status • Tail nodes mark objects “clean”
 Through acknowledgements, tail nodes mark an object “clean” and remove other versions 22
  56. CRAQ Algorithm • Replicas store multiple versions for each object


    Each object contains version number and a dirty/clean status • Tail nodes mark objects “clean”
 Through acknowledgements, tail nodes mark an object “clean” and remove other versions • Read operations only serve “clean” values
 Any replica can accept write and “query” the tail for the identifier of a “clean” version 22
  57. CRAQ Algorithm • Replicas store multiple versions for each object


    Each object contains version number and a dirty/clean status • Tail nodes mark objects “clean”
 Through acknowledgements, tail nodes mark an object “clean” and remove other versions • Read operations only serve “clean” values
 Any replica can accept write and “query” the tail for the identifier of a “clean” version • “Interesting Observation”
 No longer can we provide a total order over reads, only writes and reads or writes and writes. 22
  58. None
  59. None
  60. CRAQ Single-Key API • Prepend or append to a given

    object
 Apply a transformation for a given object in the data store 25
  61. CRAQ Single-Key API • Prepend or append to a given

    object
 Apply a transformation for a given object in the data store • Increment/decrement
 Increment or decrement a value for an object in the data store 25
  62. CRAQ Single-Key API • Prepend or append to a given

    object
 Apply a transformation for a given object in the data store • Increment/decrement
 Increment or decrement a value for an object in the data store • Test-and-set
 Compare and swap a value in the data store 25
  63. CRAQ Multi-Key API • Single-Chain
 Single-chain atomicity for objects located

    in the same chain 26
  64. CRAQ Multi-Key API • Single-Chain
 Single-chain atomicity for objects located

    in the same chain • Multi-Chain
 Multi-Chain update use a 2PC protocol to ensure objects are committed across chains 26
  65. CRAQ Chain Placement • Multiple Chain Placement Strategies 27

  66. CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit

    Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation 27
  67. CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit

    Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation • “Explicit Datacenters and Global Chain Size”
 Specify datacenters and chain size per datacenter 27
  68. CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit

    Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation • “Explicit Datacenters and Global Chain Size”
 Specify datacenters and chain size per datacenter • “Explicit Datacenters Chain Size”
 Specify datacenters and chains size per datacenter 27
  69. CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit

    Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation • “Explicit Datacenters and Global Chain Size”
 Specify datacenters and chain size per datacenter • “Explicit Datacenters Chain Size”
 Specify datacenters and chains size per datacenter • “Lower Latency”
 Ability to read from local nodes reduces read latency under geo-distribution 27
  70. 1 2 3 4 5 6 8 7 9

  71. 1 2 3 4 5 6 8 7 9

  72. CRAQ TCP Multicast • Can be used for disseminating updates


    Chain used only for signaling messages about how to sequence update messages 30
  73. CRAQ TCP Multicast • Can be used for disseminating updates


    Chain used only for signaling messages about how to sequence update messages • Acknowledgements
 Can be multicast as well, as long as we ensure a downward closed set on message identifiers 30
  74. 1 2 3 4 1 2 3 4 Sequencing Message

    TCP Multicast Payload
  75. FAWN: A Fast Array of Wimpy Nodes 32 Andersen et

    al. SOSP 2009
  76. FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power,

    mostly random- access computing 33
  77. FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power,

    mostly random- access computing • Solution: FAWN architecture
 Close the IO/CPU gap, optimize for low-power processors 33
  78. FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power,

    mostly random- access computing • Solution: FAWN architecture
 Close the IO/CPU gap, optimize for low-power processors • Low-power embedded CPUs 33
  79. FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power,

    mostly random- access computing • Solution: FAWN architecture
 Close the IO/CPU gap, optimize for low-power processors • Low-power embedded CPUs • Satisfy same latency, same capacity, same processing requirements 33
  80. FAWN-KV • Multi-node system named FAWN-KV
 Horizontal partitioning across FAWN-DS

    instances: log-structured data stores 34
  81. FAWN-KV • Multi-node system named FAWN-KV
 Horizontal partitioning across FAWN-DS

    instances: log-structured data stores • Similar to Riak or Chord
 Consistent hashing across the cluster with hash-space partitioning 34
  82. None
  83. FAWN-KV Optimizations • In-memory lookup by key
 Store an in-memory

    location to a key in a log- structured data structure 36
  84. FAWN-KV Optimizations • In-memory lookup by key
 Store an in-memory

    location to a key in a log- structured data structure • Update operations
 Remove reference in the log; garbage collect dangling references during compaction of the log 36
  85. FAWN-KV Optimizations • In-memory lookup by key
 Store an in-memory

    location to a key in a log- structured data structure • Update operations
 Remove reference in the log; garbage collect dangling references during compaction of the log • Buffer and log cache
 Front-end nodes that proxy requests cache requests and results to those requests 36
  86. FAWN-KV Operations • Join/Leave operations
 Two phase operations: pre-copy and

    log flush 37
  87. FAWN-KV Operations • Join/Leave operations
 Two phase operations: pre-copy and

    log flush • Pre-copy
 Ensures that joining nodes get copy of state 37
  88. FAWN-KV Operations • Join/Leave operations
 Two phase operations: pre-copy and

    log flush • Pre-copy
 Ensures that joining nodes get copy of state • Flush
 Operations ensure that operations performed after copy snapshot are flushed to the joining node 37
  89. FAWN-KV Failure Model 38

  90. FAWN-KV Failure Model • Fail-Stop
 Nodes are assumed to be

    fail stop, and failures are detected using front-end to back-end timeouts 38
  91. FAWN-KV Failure Model • Fail-Stop
 Nodes are assumed to be

    fail stop, and failures are detected using front-end to back-end timeouts • Naive failure model
 Assumed and acknowledged that backends become fully partitioned: assumed backends under partitioning can not talk to each other 38
  92. Chain Replication in Theory and in Practice 39 Fritchie Erlang

    Workshop 2010
  93. Hibari Overview • Physical and Logical Bricks
 Logical bricks exist

    on physical and make up striped chains across physical bricks 40
  94. Hibari Overview • Physical and Logical Bricks
 Logical bricks exist

    on physical and make up striped chains across physical bricks • “Table” Abstraction
 Exposes itself as a SQL-like “table” with rows made up of keys and values, one table per key 40
  95. Hibari Overview • Physical and Logical Bricks
 Logical bricks exist

    on physical and make up striped chains across physical bricks • “Table” Abstraction
 Exposes itself as a SQL-like “table” with rows made up of keys and values, one table per key • Consistent Hashing
 Multiple chains; hashed to determine what chain to write values to in the cluster 40
  96. Hibari Overview • Physical and Logical Bricks
 Logical bricks exist

    on physical and make up striped chains across physical bricks • “Table” Abstraction
 Exposes itself as a SQL-like “table” with rows made up of keys and values, one table per key • Consistent Hashing
 Multiple chains; hashed to determine what chain to write values to in the cluster • “Smart Clients”
 Clients know where to route requests given metadata information 40
  97. None
  98. Hibari “Read Priming” • “Priming” Processes
 In order to prevent

    blocking in logical bricks, processes are spawned to pre-read data from files and fill the OS page cache 42
  99. Hibari “Read Priming” • “Priming” Processes
 In order to prevent

    blocking in logical bricks, processes are spawned to pre-read data from files and fill the OS page cache • Double Reads
 Results in reading the same data twice, but is faster than blocking the entire process to perform a read operation 42
  100. Hibari Rate Control • Load Shedding
 Processes are tagged with

    a temporal time and dropped if events sit too long in the Erlang mailbox 43
  101. Hibari Rate Control • Load Shedding
 Processes are tagged with

    a temporal time and dropped if events sit too long in the Erlang mailbox • Routing Loops
 Monotonic hop counters are used to ensure that routing loops do not occur during key migration 43
  102. Hibari Admin Server • Single configuration agent
 Failure of this

    only prevents cluster reconfiguration 44
  103. Hibari Admin Server • Single configuration agent
 Failure of this

    only prevents cluster reconfiguration • Replicated state
 State is stored in the logical bricks of the cluster, but replicated using quorums 44
  104. Hibari “Fail Stop” 45

  105. Hibari “Fail Stop” • “Send and Pray”
 Erlang message passing

    can drop messages and only makes particular guarantees about ordering, but not delivery 45
  106. Hibari Partition Detector • Monitor two physical networks
 Application which

    sends heartbeat messages over two physical networks in attempt increase failure detection accuracy 46
  107. Hibari Partition Detector • Monitor two physical networks
 Application which

    sends heartbeat messages over two physical networks in attempt increase failure detection accuracy • Still problematic
 Bugs in the Erlang runtime system, backed up distribution ports, VM pauses, etc. 46
  108. Hibari “Fail Stop” Violations • Fast chain churn
 Incorrect detection

    of failures result in frequent chain reconfiguration 47
  109. Hibari “Fail Stop” Violations • Fast chain churn
 Incorrect detection

    of failures result in frequent chain reconfiguration • Zero length chains
 This can result in zero length chains if churn occurs to frequently 47
  110. HyperDex: A Distributed, Searchable Key-Value Store 48 Escriva et al.,

    SIGCOMM 2011
  111. HyperDex Motivation • Scalable systems with restricted APIs
 Only mechanism

    for querying is by “primary key” 49
  112. HyperDex Motivation • Scalable systems with restricted APIs
 Only mechanism

    for querying is by “primary key” • Secondary attributes and search
 Can we provide efficient secondary indexes and search functionality in these systems? 49
  113. HyperDex Contribution • “Hyperspace Hashing”
 Uses all attributes of an

    object to map into multi-dimensional Euclidean space 50
  114. HyperDex Contribution • “Hyperspace Hashing”
 Uses all attributes of an

    object to map into multi-dimensional Euclidean space • “Value-Dependent Chaining”
 Fault-tolerant replication protocol ensuring linearizability 50
  115. None
  116. HyperDex 
 Consistency and Replication • “Point leader”
 Determined through

    hashing, used to sequence all updates for an object 52
  117. HyperDex 
 Consistency and Replication • “Point leader”
 Determined through

    hashing, used to sequence all updates for an object • Attribute hashing
 Chain for the object is determined by hashing secondary attributes for the object 52
  118. None
  119. HyperDex 
 Consistency and Replication • Updates “relocate” values
 On

    relocation, chain contains old and new locations, ensuring they preserve the ordering 54
  120. HyperDex 
 Consistency and Replication • Updates “relocate” values
 On

    relocation, chain contains old and new locations, ensuring they preserve the ordering • Acknowledgements purge state
 Once a write is acknowledged back through the chain, old state is purged from old locations 54
  121. None
  122. HyperDex 
 Consistency and Replication • “Point leader” includes sequencing

    information
 To resolve out of order delivery for different length chains, sequencing information is included in the messages 56
  123. HyperDex 
 Consistency and Replication • “Point leader” includes sequencing

    information
 To resolve out of order delivery for different length chains, sequencing information is included in the messages • Each “node” can be a chain itself
 Fault-tolerance achieved by having each hyperspace mapping an instance of chain replication 56
  124. None
  125. HyperDex 
 Consistency and Replication • Per-key Linearizability
 Linearizable for

    all operations, all clients see the same order of events 58
  126. HyperDex 
 Consistency and Replication • Per-key Linearizability
 Linearizable for

    all operations, all clients see the same order of events • Search Consistency
 Search results are guaranteed to return all committed objects at the time of request 58
  127. Failures, tho? 59

  128. None
  129. ChainReaction: a Causal+ Consistent Datastore based on Chain Replication 61

    Almeida, Leitão, Rodrigues Eurosys 2013
  130. ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in

    the geo-replicated scenario 62
  131. ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in

    the geo-replicated scenario • Causal+ Consistency
 Causal consistency with guaranteed convergence 62
  132. ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in

    the geo-replicated scenario • Causal+ Consistency
 Causal consistency with guaranteed convergence • Low Metadata Overhead
 Ensure metadata does not cause explosive growth 62
  133. ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in

    the geo-replicated scenario • Causal+ Consistency
 Causal consistency with guaranteed convergence • Low Metadata Overhead
 Ensure metadata does not cause explosive growth • Geo-Replication
 Define an optimal strategy for geo-replication of data 62
  134. ChainReaction: Conflict Resolution 63

  135. ChainReaction: Conflict Resolution • “Last Writer Wins”
 Convergent given a

    “synchronized” physical clock, based 63
  136. ChainReaction: Conflict Resolution • “Last Writer Wins”
 Convergent given a

    “synchronized” physical clock, based • Antidote, etc.
 Show that CRDTs can be used in practice to make this more deterministic 63
  137. ChainReaction: Single Datacenter Operation • Causal Reads from K Nodes


    Given UPI, assume reads from K-1 nodes observe causal consistency for keys 64
  138. ChainReaction: Single Datacenter Operation • Causal Reads from K Nodes


    Given UPI, assume reads from K-1 nodes observe causal consistency for keys • Explicit Causality (not Potential)
 Explicit list of operations that are causally related to submitted update; multiple objects, cross chain 64
  139. ChainReaction: Single Datacenter Operation • Causal Reads from K Nodes


    Given UPI, assume reads from K-1 nodes observe causal consistency for keys • Explicit Causality (not Potential)
 Explicit list of operations that are causally related to submitted update; multiple objects, cross chain • “Datacenter Stability”
 Update is stable within a particular datacenter and no previous update will ever be observed 64
  140. ChainReaction: Multi Datacenter Operation • Tracking with DC-based “version vector”


    “Remote proxy” used to establish a DC-based version vector 65
  141. ChainReaction: Multi Datacenter Operation • Tracking with DC-based “version vector”


    “Remote proxy” used to establish a DC-based version vector • Explicit Causality (not Potential)
 Apply only updates where causal dependencies are satisfied within the DC based on a local version vector 65
  142. ChainReaction: Multi Datacenter Operation • Tracking with DC-based “version vector”


    “Remote proxy” used to establish a DC-based version vector • Explicit Causality (not Potential)
 Apply only updates where causal dependencies are satisfied within the DC based on a local version vector • “Global Stability”
 Update is stable within all datacenters and no previous update will ever be observed 65
  143. 1 2 3 4 1 2 3 4 Read Operation

    Serviced By Node 2 UPI guarantees for this chain, Node 1 is causally consistent to those operations
  144. Leveraging Sharding in the Design of Scalable Replication Protocols 67

    Abu-Libdeh, van Renesse, Vigfusson SOSP 2011 Poster Session SoCC 2013
  145. Elastic Replication: Motivation and Contributions • Customizable Consistency
 Decrease latency

    for weaker guarantees regarding consistency 68
  146. Elastic Replication: Motivation and Contributions • Customizable Consistency
 Decrease latency

    for weaker guarantees regarding consistency • Robust Consistency
 Consistency does not require accurate failure detection 68
  147. Elastic Replication: Motivation and Contributions • Customizable Consistency
 Decrease latency

    for weaker guarantees regarding consistency • Robust Consistency
 Consistency does not require accurate failure detection • Smooth Reconfiguration
 Reconfiguration can occur without a central configuration service 68
  148. Fail-Stop: Challenges 69

  149. Fail-Stop: Challenges • Primary-Backup
 False suspicion can lead to promotion

    of a backup while concurrent writes on the non-failed primary can be read 69
  150. Fail-Stop: Challenges • Primary-Backup
 False suspicion can lead to promotion

    of a backup while concurrent writes on the non-failed primary can be read • Quorum Intersection
 Under reconfiguration, quorums may not intersect for all clients 69
  151. Elastic Replication: Algorithm 70

  152. Elastic Replication: Algorithm • Replicas contain a history of commands


    Commands are sequenced by the head of the chain 70
  153. Elastic Replication: Algorithm • Replicas contain a history of commands


    Commands are sequenced by the head of the chain • Stable prefix
 As commands are acknowledged, each replica reports the length of it’s stable prefix 70
  154. Elastic Replication: Algorithm • Replicas contain a history of commands


    Commands are sequenced by the head of the chain • Stable prefix
 As commands are acknowledged, each replica reports the length of it’s stable prefix • Greatest common prefix is “learned”
 Sequencer promotes the greatest common prefix between replicas 70
  155. 1 2 3 4 addOp 1 2 3 4 adoptHistory,

    ack gcp 1 2 3 4 learnPersistence
  156. Elastic Replication: Algorithm 72

  157. Elastic Replication: Algorithm • Safety
 When nodes suspect a failure

    in the network, nodes “wedge” where no operations can be applied 72
  158. Elastic Replication: Algorithm • Safety
 When nodes suspect a failure

    in the network, nodes “wedge” where no operations can be applied • Only updates in the history may become stable 72
  159. Elastic Replication: Algorithm • Safety
 When nodes suspect a failure

    in the network, nodes “wedge” where no operations can be applied • Only updates in the history may become stable • Liveness
 Replicas and chains are reconfigured to ensure progress 72
  160. Elastic Replication: Algorithm • Safety
 When nodes suspect a failure

    in the network, nodes “wedge” where no operations can be applied • Only updates in the history may become stable • Liveness
 Replicas and chains are reconfigured to ensure progress • History is inherited from replicas and reconfigured to preserve UPI 72
  161. 1 2 3 4 addOp 1 2 3 4 adoptHistory,

    ack gcp 1 2 3 4 learnPersistence
  162. Elastic Replication: Elastic Bands 74

  163. Elastic Replication: Elastic Bands • Horizontal partitioning
 Requests are sharded

    across elastic bands for scalability 74
  164. Elastic Replication: Elastic Bands • Horizontal partitioning
 Requests are sharded

    across elastic bands for scalability • Shards configure neighboring shards
 Shards are responsible for sequencing configurations of neighboring shards 74
  165. Elastic Replication: Elastic Bands • Horizontal partitioning
 Requests are sharded

    across elastic bands for scalability • Shards configure neighboring shards
 Shards are responsible for sequencing configurations of neighboring shards • Requires external configuration
 Even with this, band configuration must be managed by an external configuration service 74
  166. None
  167. Elastic Replication: 
 Read Operations 76

  168. Elastic Replication: 
 Read Operations • Read requests must be

    sent down chain
 Read operations must be sequenced for the system to properly determine if a configuration has been wedged 76
  169. Elastic Replication: 
 Read Operations • Read requests must be

    sent down chain
 Read operations must be sequenced for the system to properly determine if a configuration has been wedged • Reads can be serviced by other nodes
 Read out of the stabilized reads for a weaker form of consistency. 76
  170. In Summary • “Fail-Stop” Assumption
 In practice, fail-stop can be

    a difficult model to provide given the imperfections in VMs, networks, and programming abstractions 77
  171. In Summary • “Fail-Stop” Assumption
 In practice, fail-stop can be

    a difficult model to provide given the imperfections in VMs, networks, and programming abstractions • Consensus
 Consensus still required for configuration, as much as we attempt to remove it from the system 77
  172. In Summary • “Fail-Stop” Assumption
 In practice, fail-stop can be

    a difficult model to provide given the imperfections in VMs, networks, and programming abstractions • Consensus
 Consensus still required for configuration, as much as we attempt to remove it from the system • Chain Replication
 Strong technique for providing linearizability, which requires only f + 1 nodes for failure tolerance 77
  173. Thanks! 78 Christopher Meiklejohn @cmeik