Slide 1

Slide 1 text

A Brief History of Chain Replication Christopher Meiklejohn // @cmeik Papers We Love Too, December 10th, 2015 1

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Famous Computer Scientists Agree Chain Replication is Confusing

Slide 7

Slide 7 text

Famous Computer Scientists Agree Chain Replication is Confusing

Slide 8

Slide 8 text

The Overview 3

Slide 9

Slide 9 text

The Overview 1. Chain Replication for High Throughput and Availability
 van Renesse & Schneider, OSDI 2004 3

Slide 10

Slide 10 text

The Overview 1. Chain Replication for High Throughput and Availability
 van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3

Slide 11

Slide 11 text

The Overview 1. Chain Replication for High Throughput and Availability
 van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 3

Slide 12

Slide 12 text

The Overview 1. Chain Replication for High Throughput and Availability
 van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 3

Slide 13

Slide 13 text

The Overview 1. Chain Replication for High Throughput and Availability
 van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 5. HyperDex: A Distributed, Searchable Key-Value Store
 Escriva et al., SIGCOMM 2011 3

Slide 14

Slide 14 text

The Overview 1. Chain Replication for High Throughput and Availability
 van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 5. HyperDex: A Distributed, Searchable Key-Value Store
 Escriva et al., SIGCOMM 2011 6. ChainReaction: a Causal+ Consistent Datastore based on Chain Replication
 Almeida, Leitão, Rodrigues, Eurosys 2013 3

Slide 15

Slide 15 text

The Overview 1. Chain Replication for High Throughput and Availability
 van Renesse & Schneider, OSDI 2004 2. Object Storage on CRAQ
 Terrace & Freedman, USENIX 2009 3. FAWN: A Fast Array of Wimpy Nodes
 Andersen et al., SOSP 2009 4. Chain Replication in Theory and in Practice
 Fritchie, Erlang 2010 5. HyperDex: A Distributed, Searchable Key-Value Store
 Escriva et al., SIGCOMM 2011 6. ChainReaction: a Causal+ Consistent Datastore based on Chain Replication
 Almeida, Leitão, Rodrigues, Eurosys 2013 7. Leveraging Sharding in the Design of Scalable Replication Protocols
 Abu-Libdeh, van Renesse, Vigfusson, SoCC 2013 3

Slide 16

Slide 16 text

The Themes 4

Slide 17

Slide 17 text

The Themes • Failure Detection 4

Slide 18

Slide 18 text

The Themes • Failure Detection • Centralized Configuration Manager 4

Slide 19

Slide 19 text

Chain Replication for High Throughput and Availability 5 van Renesse & Schneider OSDI 2004

Slide 20

Slide 20 text

Storage Service API • V <- read(objId)
 Read the value for an object in the system 6

Slide 21

Slide 21 text

Storage Service API • V <- read(objId)
 Read the value for an object in the system • write(objId, V)
 Write an object to the system 6

Slide 22

Slide 22 text

Primary-Backup Replication • Primary-Backup
 Primary sequences all write operations and forwards them to a non-faulty replica 7

Slide 23

Slide 23 text

Primary-Backup Replication • Primary-Backup
 Primary sequences all write operations and forwards them to a non-faulty replica • Centralized Configuration Manager
 Promotes a backup replica to a primary replica in the event of a failure 7

Slide 24

Slide 24 text

Quorum Intersection Replication • Quorum Intersection
 Read and write quorums used to perform requests against a replica set, ensure overlapping quorums 8

Slide 25

Slide 25 text

Quorum Intersection Replication • Quorum Intersection
 Read and write quorums used to perform requests against a replica set, ensure overlapping quorums • Increased performance
 Increased performance when you do not perform operations against every replica in the replica set 8

Slide 26

Slide 26 text

Quorum Intersection Replication • Quorum Intersection
 Read and write quorums used to perform requests against a replica set, ensure overlapping quorums • Increased performance
 Increased performance when you do not perform operations against every replica in the replica set • Centralized Configuration Manager
 Establishes replicas, replica sets and quorums 8

Slide 27

Slide 27 text

Chain Replication Contributions • High-throughput
 Nodes process updates in serial, responsibility of “primary” divided between the head and the tail nodes 9

Slide 28

Slide 28 text

Chain Replication Contributions • High-throughput
 Nodes process updates in serial, responsibility of “primary” divided between the head and the tail nodes • High-availability
 Objects are tolerant to f failures with only f + 1 nodes 9

Slide 29

Slide 29 text

Chain Replication Contributions • High-throughput
 Nodes process updates in serial, responsibility of “primary” divided between the head and the tail nodes • High-availability
 Objects are tolerant to f failures with only f + 1 nodes • Linearizability
 Total order over all read and write operations 9

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Chain Replication Algorithm • Head applies update and ships state change
 Head performs the write operation and send the result down the chain where it is stored in replicas history 11

Slide 32

Slide 32 text

Chain Replication Algorithm • Head applies update and ships state change
 Head performs the write operation and send the result down the chain where it is stored in replicas history • Tail “acknowledges” the request
 Tail node “acknowledges” the user and services write operations 11

Slide 33

Slide 33 text

Chain Replication Algorithm • Head applies update and ships state change
 Head performs the write operation and send the result down the chain where it is stored in replicas history • Tail “acknowledges” the request
 Tail node “acknowledges” the user and services write operations • “Update Propagation Invariant”
 Reliable FIFO links for delivering messages, we can say that servers in a chain will have potentially greater histories than their successors 11

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Failures? 13

Slide 36

Slide 36 text

Failures? 13 Reconfigure Chains

Slide 37

Slide 37 text

Chain Replication Failure Detection • Centralized Configuration Manager
 Responsible for managing the “chain” and performing failure detection 14

Slide 38

Slide 38 text

Chain Replication Failure Detection • Centralized Configuration Manager
 Responsible for managing the “chain” and performing failure detection • “Fail-stop” failure model
 Processors fail by halting, do not perform an erroneous state transition, and can be reliably detected 14

Slide 39

Slide 39 text

Chain Replication Reconfiguration • Failure of the head node
 Remove H replace with successor to H 15

Slide 40

Slide 40 text

Chain Replication Reconfiguration • Failure of the head node
 Remove H replace with successor to H • Failure of the tail node
 Remove T replace with predecessor to T 15

Slide 41

Slide 41 text

Chain Replication Reconfiguration • Failure of a “middle” node
 Introduce acknowledgements, and track “in-flight” updates between members of a chain 16

Slide 42

Slide 42 text

Chain Replication Reconfiguration • Failure of a “middle” node
 Introduce acknowledgements, and track “in-flight” updates between members of a chain • “Inprocess Request Invariant”
 History of a given node is the history of its successor with “in-flight” updates 16

Slide 43

Slide 43 text

1 2 3 4 1 2 4 1 2 4

Slide 44

Slide 44 text

Object Storage on CRAQ 18 Terrace & Freedman USENIX 2009

Slide 45

Slide 45 text

CRAQ Motivation • CRAQ
 “Chain Replication with Apportioned Queries” 19

Slide 46

Slide 46 text

CRAQ Motivation • CRAQ
 “Chain Replication with Apportioned Queries” • Motivation
 Read operations can only be serviced by the tail 19

Slide 47

Slide 47 text

CRAQ Contributions • Read Operations
 Any node can service read operations for the cluster, removing hotspots 20

Slide 48

Slide 48 text

CRAQ Contributions • Read Operations
 Any node can service read operations for the cluster, removing hotspots • Partitioning
 During network partitions: “eventually consistent” reads 20

Slide 49

Slide 49 text

CRAQ Contributions • Read Operations
 Any node can service read operations for the cluster, removing hotspots • Partitioning
 During network partitions: “eventually consistent” reads • Multi-Datacenter Load Balancing
 Provide a mechanism for performing multi- datacenter load balancing 20

Slide 50

Slide 50 text

CRAQ Consistency Models • Strong Consistency
 Per-key linearizability 21

Slide 51

Slide 51 text

CRAQ Consistency Models • Strong Consistency
 Per-key linearizability • Eventual Consistency
 Read newest available version 21

Slide 52

Slide 52 text

CRAQ Consistency Models • Strong Consistency
 Per-key linearizability • Eventual Consistency
 Read newest available version • “Session Guarantee”
 Monotonic read consistency for reads at a node 21

Slide 53

Slide 53 text

CRAQ Consistency Models • Strong Consistency
 Per-key linearizability • Eventual Consistency
 Read newest available version • “Session Guarantee”
 Monotonic read consistency for reads at a node • Restricted Eventual Consistency
 Restricted with maximal bounded inconsistency based on versioning or physical time 21

Slide 54

Slide 54 text

CRAQ Algorithm • Replicas store multiple versions for each object
 Each object contains version number and a dirty/clean status 22

Slide 55

Slide 55 text

CRAQ Algorithm • Replicas store multiple versions for each object
 Each object contains version number and a dirty/clean status • Tail nodes mark objects “clean”
 Through acknowledgements, tail nodes mark an object “clean” and remove other versions 22

Slide 56

Slide 56 text

CRAQ Algorithm • Replicas store multiple versions for each object
 Each object contains version number and a dirty/clean status • Tail nodes mark objects “clean”
 Through acknowledgements, tail nodes mark an object “clean” and remove other versions • Read operations only serve “clean” values
 Any replica can accept write and “query” the tail for the identifier of a “clean” version 22

Slide 57

Slide 57 text

CRAQ Algorithm • Replicas store multiple versions for each object
 Each object contains version number and a dirty/clean status • Tail nodes mark objects “clean”
 Through acknowledgements, tail nodes mark an object “clean” and remove other versions • Read operations only serve “clean” values
 Any replica can accept write and “query” the tail for the identifier of a “clean” version • “Interesting Observation”
 No longer can we provide a total order over reads, only writes and reads or writes and writes. 22

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

CRAQ Single-Key API • Prepend or append to a given object
 Apply a transformation for a given object in the data store 25

Slide 61

Slide 61 text

CRAQ Single-Key API • Prepend or append to a given object
 Apply a transformation for a given object in the data store • Increment/decrement
 Increment or decrement a value for an object in the data store 25

Slide 62

Slide 62 text

CRAQ Single-Key API • Prepend or append to a given object
 Apply a transformation for a given object in the data store • Increment/decrement
 Increment or decrement a value for an object in the data store • Test-and-set
 Compare and swap a value in the data store 25

Slide 63

Slide 63 text

CRAQ Multi-Key API • Single-Chain
 Single-chain atomicity for objects located in the same chain 26

Slide 64

Slide 64 text

CRAQ Multi-Key API • Single-Chain
 Single-chain atomicity for objects located in the same chain • Multi-Chain
 Multi-Chain update use a 2PC protocol to ensure objects are committed across chains 26

Slide 65

Slide 65 text

CRAQ Chain Placement • Multiple Chain Placement Strategies 27

Slide 66

Slide 66 text

CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation 27

Slide 67

Slide 67 text

CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation • “Explicit Datacenters and Global Chain Size”
 Specify datacenters and chain size per datacenter 27

Slide 68

Slide 68 text

CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation • “Explicit Datacenters and Global Chain Size”
 Specify datacenters and chain size per datacenter • “Explicit Datacenters Chain Size”
 Specify datacenters and chains size per datacenter 27

Slide 69

Slide 69 text

CRAQ Chain Placement • Multiple Chain Placement Strategies • “Implicit Datacenters and Global Chain Size”
 Specify number of DC’s and chain size during creation • “Explicit Datacenters and Global Chain Size”
 Specify datacenters and chain size per datacenter • “Explicit Datacenters Chain Size”
 Specify datacenters and chains size per datacenter • “Lower Latency”
 Ability to read from local nodes reduces read latency under geo-distribution 27

Slide 70

Slide 70 text

1 2 3 4 5 6 8 7 9

Slide 71

Slide 71 text

1 2 3 4 5 6 8 7 9

Slide 72

Slide 72 text

CRAQ TCP Multicast • Can be used for disseminating updates
 Chain used only for signaling messages about how to sequence update messages 30

Slide 73

Slide 73 text

CRAQ TCP Multicast • Can be used for disseminating updates
 Chain used only for signaling messages about how to sequence update messages • Acknowledgements
 Can be multicast as well, as long as we ensure a downward closed set on message identifiers 30

Slide 74

Slide 74 text

1 2 3 4 1 2 3 4 Sequencing Message TCP Multicast Payload

Slide 75

Slide 75 text

FAWN: A Fast Array of Wimpy Nodes 32 Andersen et al. SOSP 2009

Slide 76

Slide 76 text

FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power, mostly random- access computing 33

Slide 77

Slide 77 text

FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power, mostly random- access computing • Solution: FAWN architecture
 Close the IO/CPU gap, optimize for low-power processors 33

Slide 78

Slide 78 text

FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power, mostly random- access computing • Solution: FAWN architecture
 Close the IO/CPU gap, optimize for low-power processors • Low-power embedded CPUs 33

Slide 79

Slide 79 text

FAWN-KV & FAWN-DS • “Low-power, data-intensive computing”
 Massively powerful, low-power, mostly random- access computing • Solution: FAWN architecture
 Close the IO/CPU gap, optimize for low-power processors • Low-power embedded CPUs • Satisfy same latency, same capacity, same processing requirements 33

Slide 80

Slide 80 text

FAWN-KV • Multi-node system named FAWN-KV
 Horizontal partitioning across FAWN-DS instances: log-structured data stores 34

Slide 81

Slide 81 text

FAWN-KV • Multi-node system named FAWN-KV
 Horizontal partitioning across FAWN-DS instances: log-structured data stores • Similar to Riak or Chord
 Consistent hashing across the cluster with hash-space partitioning 34

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

FAWN-KV Optimizations • In-memory lookup by key
 Store an in-memory location to a key in a log- structured data structure 36

Slide 84

Slide 84 text

FAWN-KV Optimizations • In-memory lookup by key
 Store an in-memory location to a key in a log- structured data structure • Update operations
 Remove reference in the log; garbage collect dangling references during compaction of the log 36

Slide 85

Slide 85 text

FAWN-KV Optimizations • In-memory lookup by key
 Store an in-memory location to a key in a log- structured data structure • Update operations
 Remove reference in the log; garbage collect dangling references during compaction of the log • Buffer and log cache
 Front-end nodes that proxy requests cache requests and results to those requests 36

Slide 86

Slide 86 text

FAWN-KV Operations • Join/Leave operations
 Two phase operations: pre-copy and log flush 37

Slide 87

Slide 87 text

FAWN-KV Operations • Join/Leave operations
 Two phase operations: pre-copy and log flush • Pre-copy
 Ensures that joining nodes get copy of state 37

Slide 88

Slide 88 text

FAWN-KV Operations • Join/Leave operations
 Two phase operations: pre-copy and log flush • Pre-copy
 Ensures that joining nodes get copy of state • Flush
 Operations ensure that operations performed after copy snapshot are flushed to the joining node 37

Slide 89

Slide 89 text

FAWN-KV Failure Model 38

Slide 90

Slide 90 text

FAWN-KV Failure Model • Fail-Stop
 Nodes are assumed to be fail stop, and failures are detected using front-end to back-end timeouts 38

Slide 91

Slide 91 text

FAWN-KV Failure Model • Fail-Stop
 Nodes are assumed to be fail stop, and failures are detected using front-end to back-end timeouts • Naive failure model
 Assumed and acknowledged that backends become fully partitioned: assumed backends under partitioning can not talk to each other 38

Slide 92

Slide 92 text

Chain Replication in Theory and in Practice 39 Fritchie Erlang Workshop 2010

Slide 93

Slide 93 text

Hibari Overview • Physical and Logical Bricks
 Logical bricks exist on physical and make up striped chains across physical bricks 40

Slide 94

Slide 94 text

Hibari Overview • Physical and Logical Bricks
 Logical bricks exist on physical and make up striped chains across physical bricks • “Table” Abstraction
 Exposes itself as a SQL-like “table” with rows made up of keys and values, one table per key 40

Slide 95

Slide 95 text

Hibari Overview • Physical and Logical Bricks
 Logical bricks exist on physical and make up striped chains across physical bricks • “Table” Abstraction
 Exposes itself as a SQL-like “table” with rows made up of keys and values, one table per key • Consistent Hashing
 Multiple chains; hashed to determine what chain to write values to in the cluster 40

Slide 96

Slide 96 text

Hibari Overview • Physical and Logical Bricks
 Logical bricks exist on physical and make up striped chains across physical bricks • “Table” Abstraction
 Exposes itself as a SQL-like “table” with rows made up of keys and values, one table per key • Consistent Hashing
 Multiple chains; hashed to determine what chain to write values to in the cluster • “Smart Clients”
 Clients know where to route requests given metadata information 40

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

Hibari “Read Priming” • “Priming” Processes
 In order to prevent blocking in logical bricks, processes are spawned to pre-read data from files and fill the OS page cache 42

Slide 99

Slide 99 text

Hibari “Read Priming” • “Priming” Processes
 In order to prevent blocking in logical bricks, processes are spawned to pre-read data from files and fill the OS page cache • Double Reads
 Results in reading the same data twice, but is faster than blocking the entire process to perform a read operation 42

Slide 100

Slide 100 text

Hibari Rate Control • Load Shedding
 Processes are tagged with a temporal time and dropped if events sit too long in the Erlang mailbox 43

Slide 101

Slide 101 text

Hibari Rate Control • Load Shedding
 Processes are tagged with a temporal time and dropped if events sit too long in the Erlang mailbox • Routing Loops
 Monotonic hop counters are used to ensure that routing loops do not occur during key migration 43

Slide 102

Slide 102 text

Hibari Admin Server • Single configuration agent
 Failure of this only prevents cluster reconfiguration 44

Slide 103

Slide 103 text

Hibari Admin Server • Single configuration agent
 Failure of this only prevents cluster reconfiguration • Replicated state
 State is stored in the logical bricks of the cluster, but replicated using quorums 44

Slide 104

Slide 104 text

Hibari “Fail Stop” 45

Slide 105

Slide 105 text

Hibari “Fail Stop” • “Send and Pray”
 Erlang message passing can drop messages and only makes particular guarantees about ordering, but not delivery 45

Slide 106

Slide 106 text

Hibari Partition Detector • Monitor two physical networks
 Application which sends heartbeat messages over two physical networks in attempt increase failure detection accuracy 46

Slide 107

Slide 107 text

Hibari Partition Detector • Monitor two physical networks
 Application which sends heartbeat messages over two physical networks in attempt increase failure detection accuracy • Still problematic
 Bugs in the Erlang runtime system, backed up distribution ports, VM pauses, etc. 46

Slide 108

Slide 108 text

Hibari “Fail Stop” Violations • Fast chain churn
 Incorrect detection of failures result in frequent chain reconfiguration 47

Slide 109

Slide 109 text

Hibari “Fail Stop” Violations • Fast chain churn
 Incorrect detection of failures result in frequent chain reconfiguration • Zero length chains
 This can result in zero length chains if churn occurs to frequently 47

Slide 110

Slide 110 text

HyperDex: A Distributed, Searchable Key-Value Store 48 Escriva et al., SIGCOMM 2011

Slide 111

Slide 111 text

HyperDex Motivation • Scalable systems with restricted APIs
 Only mechanism for querying is by “primary key” 49

Slide 112

Slide 112 text

HyperDex Motivation • Scalable systems with restricted APIs
 Only mechanism for querying is by “primary key” • Secondary attributes and search
 Can we provide efficient secondary indexes and search functionality in these systems? 49

Slide 113

Slide 113 text

HyperDex Contribution • “Hyperspace Hashing”
 Uses all attributes of an object to map into multi-dimensional Euclidean space 50

Slide 114

Slide 114 text

HyperDex Contribution • “Hyperspace Hashing”
 Uses all attributes of an object to map into multi-dimensional Euclidean space • “Value-Dependent Chaining”
 Fault-tolerant replication protocol ensuring linearizability 50

Slide 115

Slide 115 text

No content

Slide 116

Slide 116 text

HyperDex 
 Consistency and Replication • “Point leader”
 Determined through hashing, used to sequence all updates for an object 52

Slide 117

Slide 117 text

HyperDex 
 Consistency and Replication • “Point leader”
 Determined through hashing, used to sequence all updates for an object • Attribute hashing
 Chain for the object is determined by hashing secondary attributes for the object 52

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

HyperDex 
 Consistency and Replication • Updates “relocate” values
 On relocation, chain contains old and new locations, ensuring they preserve the ordering 54

Slide 120

Slide 120 text

HyperDex 
 Consistency and Replication • Updates “relocate” values
 On relocation, chain contains old and new locations, ensuring they preserve the ordering • Acknowledgements purge state
 Once a write is acknowledged back through the chain, old state is purged from old locations 54

Slide 121

Slide 121 text

No content

Slide 122

Slide 122 text

HyperDex 
 Consistency and Replication • “Point leader” includes sequencing information
 To resolve out of order delivery for different length chains, sequencing information is included in the messages 56

Slide 123

Slide 123 text

HyperDex 
 Consistency and Replication • “Point leader” includes sequencing information
 To resolve out of order delivery for different length chains, sequencing information is included in the messages • Each “node” can be a chain itself
 Fault-tolerance achieved by having each hyperspace mapping an instance of chain replication 56

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

HyperDex 
 Consistency and Replication • Per-key Linearizability
 Linearizable for all operations, all clients see the same order of events 58

Slide 126

Slide 126 text

HyperDex 
 Consistency and Replication • Per-key Linearizability
 Linearizable for all operations, all clients see the same order of events • Search Consistency
 Search results are guaranteed to return all committed objects at the time of request 58

Slide 127

Slide 127 text

Failures, tho? 59

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

ChainReaction: a Causal+ Consistent Datastore based on Chain Replication 61 Almeida, Leitão, Rodrigues Eurosys 2013

Slide 130

Slide 130 text

ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in the geo-replicated scenario 62

Slide 131

Slide 131 text

ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in the geo-replicated scenario • Causal+ Consistency
 Causal consistency with guaranteed convergence 62

Slide 132

Slide 132 text

ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in the geo-replicated scenario • Causal+ Consistency
 Causal consistency with guaranteed convergence • Low Metadata Overhead
 Ensure metadata does not cause explosive growth 62

Slide 133

Slide 133 text

ChainReaction: Motivation and Contributions • Per-Key Linearizability
 Too expensive in the geo-replicated scenario • Causal+ Consistency
 Causal consistency with guaranteed convergence • Low Metadata Overhead
 Ensure metadata does not cause explosive growth • Geo-Replication
 Define an optimal strategy for geo-replication of data 62

Slide 134

Slide 134 text

ChainReaction: Conflict Resolution 63

Slide 135

Slide 135 text

ChainReaction: Conflict Resolution • “Last Writer Wins”
 Convergent given a “synchronized” physical clock, based 63

Slide 136

Slide 136 text

ChainReaction: Conflict Resolution • “Last Writer Wins”
 Convergent given a “synchronized” physical clock, based • Antidote, etc.
 Show that CRDTs can be used in practice to make this more deterministic 63

Slide 137

Slide 137 text

ChainReaction: Single Datacenter Operation • Causal Reads from K Nodes
 Given UPI, assume reads from K-1 nodes observe causal consistency for keys 64

Slide 138

Slide 138 text

ChainReaction: Single Datacenter Operation • Causal Reads from K Nodes
 Given UPI, assume reads from K-1 nodes observe causal consistency for keys • Explicit Causality (not Potential)
 Explicit list of operations that are causally related to submitted update; multiple objects, cross chain 64

Slide 139

Slide 139 text

ChainReaction: Single Datacenter Operation • Causal Reads from K Nodes
 Given UPI, assume reads from K-1 nodes observe causal consistency for keys • Explicit Causality (not Potential)
 Explicit list of operations that are causally related to submitted update; multiple objects, cross chain • “Datacenter Stability”
 Update is stable within a particular datacenter and no previous update will ever be observed 64

Slide 140

Slide 140 text

ChainReaction: Multi Datacenter Operation • Tracking with DC-based “version vector”
 “Remote proxy” used to establish a DC-based version vector 65

Slide 141

Slide 141 text

ChainReaction: Multi Datacenter Operation • Tracking with DC-based “version vector”
 “Remote proxy” used to establish a DC-based version vector • Explicit Causality (not Potential)
 Apply only updates where causal dependencies are satisfied within the DC based on a local version vector 65

Slide 142

Slide 142 text

ChainReaction: Multi Datacenter Operation • Tracking with DC-based “version vector”
 “Remote proxy” used to establish a DC-based version vector • Explicit Causality (not Potential)
 Apply only updates where causal dependencies are satisfied within the DC based on a local version vector • “Global Stability”
 Update is stable within all datacenters and no previous update will ever be observed 65

Slide 143

Slide 143 text

1 2 3 4 1 2 3 4 Read Operation Serviced By Node 2 UPI guarantees for this chain, Node 1 is causally consistent to those operations

Slide 144

Slide 144 text

Leveraging Sharding in the Design of Scalable Replication Protocols 67 Abu-Libdeh, van Renesse, Vigfusson SOSP 2011 Poster Session SoCC 2013

Slide 145

Slide 145 text

Elastic Replication: Motivation and Contributions • Customizable Consistency
 Decrease latency for weaker guarantees regarding consistency 68

Slide 146

Slide 146 text

Elastic Replication: Motivation and Contributions • Customizable Consistency
 Decrease latency for weaker guarantees regarding consistency • Robust Consistency
 Consistency does not require accurate failure detection 68

Slide 147

Slide 147 text

Elastic Replication: Motivation and Contributions • Customizable Consistency
 Decrease latency for weaker guarantees regarding consistency • Robust Consistency
 Consistency does not require accurate failure detection • Smooth Reconfiguration
 Reconfiguration can occur without a central configuration service 68

Slide 148

Slide 148 text

Fail-Stop: Challenges 69

Slide 149

Slide 149 text

Fail-Stop: Challenges • Primary-Backup
 False suspicion can lead to promotion of a backup while concurrent writes on the non-failed primary can be read 69

Slide 150

Slide 150 text

Fail-Stop: Challenges • Primary-Backup
 False suspicion can lead to promotion of a backup while concurrent writes on the non-failed primary can be read • Quorum Intersection
 Under reconfiguration, quorums may not intersect for all clients 69

Slide 151

Slide 151 text

Elastic Replication: Algorithm 70

Slide 152

Slide 152 text

Elastic Replication: Algorithm • Replicas contain a history of commands
 Commands are sequenced by the head of the chain 70

Slide 153

Slide 153 text

Elastic Replication: Algorithm • Replicas contain a history of commands
 Commands are sequenced by the head of the chain • Stable prefix
 As commands are acknowledged, each replica reports the length of it’s stable prefix 70

Slide 154

Slide 154 text

Elastic Replication: Algorithm • Replicas contain a history of commands
 Commands are sequenced by the head of the chain • Stable prefix
 As commands are acknowledged, each replica reports the length of it’s stable prefix • Greatest common prefix is “learned”
 Sequencer promotes the greatest common prefix between replicas 70

Slide 155

Slide 155 text

1 2 3 4 addOp 1 2 3 4 adoptHistory, ack gcp 1 2 3 4 learnPersistence

Slide 156

Slide 156 text

Elastic Replication: Algorithm 72

Slide 157

Slide 157 text

Elastic Replication: Algorithm • Safety
 When nodes suspect a failure in the network, nodes “wedge” where no operations can be applied 72

Slide 158

Slide 158 text

Elastic Replication: Algorithm • Safety
 When nodes suspect a failure in the network, nodes “wedge” where no operations can be applied • Only updates in the history may become stable 72

Slide 159

Slide 159 text

Elastic Replication: Algorithm • Safety
 When nodes suspect a failure in the network, nodes “wedge” where no operations can be applied • Only updates in the history may become stable • Liveness
 Replicas and chains are reconfigured to ensure progress 72

Slide 160

Slide 160 text

Elastic Replication: Algorithm • Safety
 When nodes suspect a failure in the network, nodes “wedge” where no operations can be applied • Only updates in the history may become stable • Liveness
 Replicas and chains are reconfigured to ensure progress • History is inherited from replicas and reconfigured to preserve UPI 72

Slide 161

Slide 161 text

1 2 3 4 addOp 1 2 3 4 adoptHistory, ack gcp 1 2 3 4 learnPersistence

Slide 162

Slide 162 text

Elastic Replication: Elastic Bands 74

Slide 163

Slide 163 text

Elastic Replication: Elastic Bands • Horizontal partitioning
 Requests are sharded across elastic bands for scalability 74

Slide 164

Slide 164 text

Elastic Replication: Elastic Bands • Horizontal partitioning
 Requests are sharded across elastic bands for scalability • Shards configure neighboring shards
 Shards are responsible for sequencing configurations of neighboring shards 74

Slide 165

Slide 165 text

Elastic Replication: Elastic Bands • Horizontal partitioning
 Requests are sharded across elastic bands for scalability • Shards configure neighboring shards
 Shards are responsible for sequencing configurations of neighboring shards • Requires external configuration
 Even with this, band configuration must be managed by an external configuration service 74

Slide 166

Slide 166 text

No content

Slide 167

Slide 167 text

Elastic Replication: 
 Read Operations 76

Slide 168

Slide 168 text

Elastic Replication: 
 Read Operations • Read requests must be sent down chain
 Read operations must be sequenced for the system to properly determine if a configuration has been wedged 76

Slide 169

Slide 169 text

Elastic Replication: 
 Read Operations • Read requests must be sent down chain
 Read operations must be sequenced for the system to properly determine if a configuration has been wedged • Reads can be serviced by other nodes
 Read out of the stabilized reads for a weaker form of consistency. 76

Slide 170

Slide 170 text

In Summary • “Fail-Stop” Assumption
 In practice, fail-stop can be a difficult model to provide given the imperfections in VMs, networks, and programming abstractions 77

Slide 171

Slide 171 text

In Summary • “Fail-Stop” Assumption
 In practice, fail-stop can be a difficult model to provide given the imperfections in VMs, networks, and programming abstractions • Consensus
 Consensus still required for configuration, as much as we attempt to remove it from the system 77

Slide 172

Slide 172 text

In Summary • “Fail-Stop” Assumption
 In practice, fail-stop can be a difficult model to provide given the imperfections in VMs, networks, and programming abstractions • Consensus
 Consensus still required for configuration, as much as we attempt to remove it from the system • Chain Replication
 Strong technique for providing linearizability, which requires only f + 1 nodes for failure tolerance 77

Slide 173

Slide 173 text

Thanks! 78 Christopher Meiklejohn @cmeik