Slide 1

Slide 1 text

Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica PBS

Slide 2

Slide 2 text

Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica VLDB 2012 UC Berkeley Probabilistically Bounded Staleness for Practical Partial Quorums PBS

Slide 3

Slide 3 text

R+W

Slide 4

Slide 4 text

R+W strong consistency

Slide 5

Slide 5 text

R+W strong consistency eventual consistency

Slide 6

Slide 6 text

R+W strong consistency higher latency eventual consistency

Slide 7

Slide 7 text

R+W strong consistency higher latency eventual consistency lower latency

Slide 8

Slide 8 text

R+W strong consistency higher latency eventual consistency lower latency

Slide 9

Slide 9 text

consistency is a choice binary

Slide 10

Slide 10 text

consistency is a choice binary strong eventual

Slide 11

Slide 11 text

consistency continuum is a strong eventual

Slide 12

Slide 12 text

consistency continuum is a strong eventual

Slide 13

Slide 13 text

consistency continuum is a strong eventual

Slide 14

Slide 14 text

latency vs. consistency our focus:

Slide 15

Slide 15 text

latency vs. consistency our focus:

Slide 16

Slide 16 text

latency vs. consistency informed by practice our focus:

Slide 17

Slide 17 text

latency vs. consistency informed by practice our focus: availability, partitions, failures not in this talk:

Slide 18

Slide 18 text

quantify eventual consistency: wall-clock time (“how eventual?”) versions (“how consistent?”) analyze real-world systems: EC is often strongly consistent describe when and why our contributions

Slide 19

Slide 19 text

intro system model practice metrics insights integration

Slide 20

Slide 20 text

Dynamo: Amazon’s Highly Available Key-value Store SOSP 2007

Slide 21

Slide 21 text

Apache, DataStax Project Voldemort Dynamo: Amazon’s Highly Available Key-value Store SOSP 2007

Slide 22

Slide 22 text

Adobe Cisco Digg Gowalla IBM Morningstar Netflix Palantir Rackspace Reddit Rhapsody Shazam Spotify Soundcloud Twitter Mozilla Ask.com Yammer Aol GitHub JoyentCloud Best Buy LinkedIn Boeing Comcast Cassandra Riak Voldemort Gilt Groupe

Slide 23

Slide 23 text

N replicas/key read: wait for R replies write: wait for W acks

Slide 24

Slide 24 text

N replicas/key read: wait for R replies write: wait for W acks

Slide 25

Slide 25 text

N replicas/key read: wait for R replies write: wait for W acks N=3

Slide 26

Slide 26 text

N replicas/key read: wait for R replies write: wait for W acks N=3

Slide 27

Slide 27 text

N replicas/key read: wait for R replies write: wait for W acks N=3

Slide 28

Slide 28 text

N replicas/key read: wait for R replies write: wait for W acks N=3 R=2

Slide 29

Slide 29 text

N replicas/key read: wait for R replies write: wait for W acks N=3 R=2

Slide 30

Slide 30 text

“strong” consistency else: R+W > N if: eventual consistency then:

Slide 31

Slide 31 text

reads return the last acknowledged write or an in-flight write (per-key) consistency _ _ _ “strong” regular register R+W > N

Slide 32

Slide 32 text

Latency LinkedIn disk-based model N=3

Slide 33

Slide 33 text

99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x 6.13x R Latency LinkedIn disk-based model N=3

Slide 34

Slide 34 text

99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x 6.13x R W 99th 99.9th 1 1x 1x 2 2.01x 1.9x 3 4.96x 14.96x Latency LinkedIn disk-based model N=3

Slide 35

Slide 35 text

⇧consistency, ⇧latency wait for more replicas, read more recent data consistency, ⇧ ⇧ latency wait for fewer replicas, read less recent data

Slide 36

Slide 36 text

⇧consistency, ⇧latency wait for more replicas, read more recent data consistency, ⇧ ⇧ latency wait for fewer replicas, read less recent data

Slide 37

Slide 37 text

eventual consistency “if no new updates are made to the object, eventually all accesses will return the last updated value” W. Vogels, CACM 2008 R+W ≤ N

Slide 38

Slide 38 text

How How long do I have to wait? eventual?

Slide 39

Slide 39 text

consistent? How What happens if I don’t wait?

Slide 40

Slide 40 text

R+W strong consistency higher latency eventual consistency lower latency

Slide 41

Slide 41 text

R+W strong consistency higher latency eventual consistency lower latency

Slide 42

Slide 42 text

R+W strong consistency higher latency eventual consistency lower latency

Slide 43

Slide 43 text

intro system model practice metrics insights integration

Slide 44

Slide 44 text

Cassandra: R=W=1, N=3 by default (1+1 ≯ 3)

Slide 45

Slide 45 text

eventual consistency “maximum performance” “very low latency” okay for “most data” “general case” in the wild

Slide 46

Slide 46 text

anecdotally, EC “good enough” for many kinds of data

Slide 47

Slide 47 text

anecdotally, EC “good enough” for many kinds of data How eventual? How consistent?

Slide 48

Slide 48 text

anecdotally, EC “good enough” for many kinds of data How eventual? How consistent? “eventual and consistent enough”

Slide 49

Slide 49 text

Can we do better?

Slide 50

Slide 50 text

can’t make promises can give expectations Can we do better?

Slide 51

Slide 51 text

Probabilistically Bounded Staleness can’t make promises can give expectations Can we do better?

Slide 52

Slide 52 text

intro system model practice metrics insights integration

Slide 53

Slide 53 text

How How long do I have to wait? eventual?

Slide 54

Slide 54 text

How eventual?

Slide 55

Slide 55 text

t-visibility: probability p of consistent reads after t seconds (e.g., 10ms after write, 99.9% of reads consistent) How eventual?

Slide 56

Slide 56 text

t-visibility depends on messaging and processing delays

Slide 57

Slide 57 text

Coordinator Replica once per replica T i m e

Slide 58

Slide 58 text

Coordinator Replica write once per replica T i m e

Slide 59

Slide 59 text

Coordinator Replica write ack once per replica T i m e

Slide 60

Slide 60 text

Coordinator Replica write ack wait for W responses once per replica T i m e

Slide 61

Slide 61 text

Coordinator Replica write ack wait for W responses t seconds elapse once per replica T i m e

Slide 62

Slide 62 text

Coordinator Replica write ack read wait for W responses t seconds elapse once per replica T i m e

Slide 63

Slide 63 text

Coordinator Replica write ack read response wait for W responses t seconds elapse once per replica T i m e

Slide 64

Slide 64 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses once per replica T i m e

Slide 65

Slide 65 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

Slide 66

Slide 66 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

Slide 67

Slide 67 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

Slide 68

Slide 68 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

Slide 69

Slide 69 text

R1 N=2 T i m e Alice R2

Slide 70

Slide 70 text

R1 write N=2 T i m e Alice R2

Slide 71

Slide 71 text

write ack N=2 T i m e Alice R2 R1

Slide 72

Slide 72 text

write ack W=1 N=2 T i m e Alice R2 R1

Slide 73

Slide 73 text

write ack W=1 N=2 T i m e Alice R2 R1

Slide 74

Slide 74 text

write ack W=1 N=2 T i m e Alice Bob R2 R1

Slide 75

Slide 75 text

write ack read W=1 N=2 T i m e Alice Bob R2 R1

Slide 76

Slide 76 text

R2 write ack read W=1 N=2 T i m e Alice Bob R2 R1

Slide 77

Slide 77 text

R2 write ack read W=1 N=2 response T i m e Alice Bob R2 R1

Slide 78

Slide 78 text

R2 write ack read W=1 R=1 N=2 response T i m e Alice Bob R2 R1

Slide 79

Slide 79 text

R2 write ack read W=1 R=1 N=2 response T i m e Alice Bob R2 R1 inconsistent

Slide 80

Slide 80 text

R2 write ack read W=1 R=1 N=2 response T i m e Alice Bob R2 R1 inconsistent

Slide 81

Slide 81 text

R2 write ack read W=1 R=1 N=2 response T i m e Alice Bob R2 R1 R2 inconsistent

Slide 82

Slide 82 text

write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e

Slide 83

Slide 83 text

(W) write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e

Slide 84

Slide 84 text

(W) write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e

Slide 85

Slide 85 text

(R) (W) write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e

Slide 86

Slide 86 text

(R) (W) write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) (S) Coordinator Replica T i m e

Slide 87

Slide 87 text

solving WARS: order statistics dependent variables Instead: Monte Carlo methods

Slide 88

Slide 88 text

to use WARS: W 53.2 44.5 101.1 ... A 10.3 8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data

Slide 89

Slide 89 text

to use WARS: W 53.2 44.5 101.1 ... A 10.3 8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5

Slide 90

Slide 90 text

to use WARS: W 53.2 44.5 101.1 ... A 10.3 8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3

Slide 91

Slide 91 text

to use WARS: W 53.2 44.5 101.1 ... A 10.3 8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3

Slide 92

Slide 92 text

to use WARS: W 53.2 44.5 101.1 ... A 10.3 8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3 14.2

Slide 93

Slide 93 text

real Cassandra cluster varying latencies: t-visibility RMSE: 0.28% latency N-RMSE: 0.48% WARS accuracy

Slide 94

Slide 94 text

How eventual? key: WARS model need: latencies t-visibility: consistent reads with probability p after t seconds

Slide 95

Slide 95 text

intro system model practice metrics insights integration

Slide 96

Slide 96 text

Yammer 100K+ companies uses Riak LinkedIn 175M+ users built and uses Voldemort production latencies fit gaussian mixtures

Slide 97

Slide 97 text

N=3

Slide 98

Slide 98 text

10 ms N=3

Slide 99

Slide 99 text

Latency is combined read and write latency at 99.9th percentile R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms

Slide 100

Slide 100 text

Latency is combined read and write latency at 99.9th percentile R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms

Slide 101

Slide 101 text

Latency is combined read and write latency at 99.9th percentile R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms worthwhile?

Slide 102

Slide 102 text

N=3

Slide 103

Slide 103 text

N=3

Slide 104

Slide 104 text

N=3

Slide 105

Slide 105 text

Latency is combined read and write latency at 99.9th percentile R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms

Slide 106

Slide 106 text

Latency is combined read and write latency at 99.9th percentile R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 59.5% faster R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms

Slide 107

Slide 107 text

10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3

Slide 108

Slide 108 text

10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3

Slide 109

Slide 109 text

10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3

Slide 110

Slide 110 text

Coordinator Replica write ack (A) (W) response (S) (R) wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica SSDs reduce variance compared to disks! read

Slide 111

Slide 111 text

Yammer latency 81.1% ⇧ (187ms) 202 ms t-visibility 99.9th N=3

Slide 112

Slide 112 text

k-staleness (versions) How consistent? monotonic reads quorum load in the paper

Slide 113

Slide 113 text

in the paper -staleness: versions and time

Slide 114

Slide 114 text

latency distributions WAN model varying quorum sizes staleness detection in the paper

Slide 115

Slide 115 text

intro system model practice metrics insights integration

Slide 116

Slide 116 text

1.Tracing 2. Simulation 3. Tune N,R,W Integration Project Voldemort

Slide 117

Slide 117 text

https://issues.apache.org/jira/browse/CASSANDRA-4261

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

Related Work Quorum Systems • probabilistic quorums [PODC ’97] • deterministic k-quorums [DISC ’05, ’06] Consistency Verification • Golab et al. [PODC ’11] • Bermbach and Tai [M4WSOC ’11] • Wada et al. [CIDR ’11] • Anderson et al. [HotDep ’10] • Transactional consistency: Zellag and Kemme [ICDE ’11], Fekete et al. [VLDB ’09] Latency-Consistency • Daniel Abadi [Computer ’12] • Kraska et al. [VLDB ’09] Bounded Staleness Guarantees • TACT [OSDI ’00] • FRACS [ICDCS ’03] • AQuA [IEEE TPDS ’03]

Slide 121

Slide 121 text

R+W strong consistency higher latency eventual consistency lower latency

Slide 122

Slide 122 text

consistency is a

Slide 123

Slide 123 text

consistency continuum is a

Slide 124

Slide 124 text

consistency continuum is a strong eventual

Slide 125

Slide 125 text

consistency continuum is a strong eventual

Slide 126

Slide 126 text

quantify eventual consistency model staleness in time, versions latency-consistency trade-offs analyze real systems and hardware PBS

Slide 127

Slide 127 text

quantify eventual consistency model staleness in time, versions latency-consistency trade-offs analyze real systems and hardware PBS quantify which choice is best and explain why EC is often strongly consistent

Slide 128

Slide 128 text

quantify eventual consistency model staleness in time, versions latency-consistency trade-offs analyze real systems and hardware pbs.cs.berkeley.edu PBS quantify which choice is best and explain why EC is often strongly consistent

Slide 129

Slide 129 text

Extra Slides

Slide 130

Slide 130 text

Non-expanding Quorum Systems e.g., probabilistic quorums (PODC ’97) deterministic k-quorums (DISC ’05, ’06) Bounded Staleness Guarantees e.g., TACT (OSDI ’00), FRACS (ICDCS ’03)

Slide 131

Slide 131 text

Consistency Verification e.g., Golab et al. (PODC ’11), Bermbach and Tai (M4WSOC ’11), Wada et al. (CIDR ’11) Latency-Consistency Daniel Abadi (IEEE Computer ’12)

Slide 132

Slide 132 text

PBS and apps

Slide 133

Slide 133 text

staleness requires either: staleness-tolerant data structures timelines, logs cf. commutative data structures logical monotonicity asynchronous compensation code detect violations after data is returned; see paper cf. “Building on Quicksand” memories, guesses, apologies write code to fix any errors

Slide 134

Slide 134 text

minimize: (compensation cost)×(# of expected anomalies) asynchronous compensation

Slide 135

Slide 135 text

Read only newer data? client’s read rate global write rate (monotonic reads session guarantee) # versions tolerable staleness = (for a given key)

Slide 136

Slide 136 text

Failure?

Slide 137

Slide 137 text

latency spi kes Treat failures as

Slide 138

Slide 138 text

How l o n g do partitions last?

Slide 139

Slide 139 text

what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76 consecutive hours down 㱺 bad 8-hour rolling average

Slide 140

Slide 140 text

what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76 consecutive hours down 㱺 bad 8-hour rolling average hide in tail of distribution OR continuously evaluate SLA, adjust

Slide 141

Slide 141 text

10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK N=3

Slide 142

Slide 142 text

10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 R=3 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 (LNKD-SSD and LNKD-DISK identical for reads) N=3

Slide 143

Slide 143 text

-staleness: versions and time

Slide 144

Slide 144 text

-staleness: versions and time approximation: exponentiate t-staleness by k

Slide 145

Slide 145 text

reads return the last written value or newer (defined w.r.t. real time, when the read started) consistency _ _ _ “strong”

Slide 146

Slide 146 text

R1 N = 3 replicas R2 R3 Write to W, read from R replicas

Slide 147

Slide 147 text

R1 N = 3 replicas R2 R3 R=W=3 replicas { } } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection

Slide 148

Slide 148 text

R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1 replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect

Slide 149

Slide 149 text

Synthetic, Exponential Distributions N=3, W=1, R=1

Slide 150

Slide 150 text

Synthetic, Exponential Distributions W 1/4x ARS N=3, W=1, R=1

Slide 151

Slide 151 text

Synthetic, Exponential Distributions W 1/4x ARS W 10x ARS N=3, W=1, R=1

Slide 152

Slide 152 text

concurrent writes: deterministically choose Coordinator R=2 (“key”, 1) (“key”, 2)

Slide 153

Slide 153 text

No content

Slide 154

Slide 154 text

No content

Slide 155

Slide 155 text

No content

Slide 156

Slide 156 text

No content

Slide 157

Slide 157 text

N = 3 replicas Coordinator client read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

Slide 158

Slide 158 text

N = 3 replicas Coordinator client read(“key”) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

Slide 159

Slide 159 text

N = 3 replicas Coordinator read(“key”) client read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

Slide 160

Slide 160 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”, 1) client read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

Slide 161

Slide 161 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”, 1) client (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

Slide 162

Slide 162 text

N = 3 replicas Coordinator read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 163

Slide 163 text

N = 3 replicas Coordinator read(“key”) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 164

Slide 164 text

N = 3 replicas Coordinator read(“key”) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 165

Slide 165 text

N = 3 replicas Coordinator (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 166

Slide 166 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 167

Slide 167 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 168

Slide 168 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”, 1) (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 169

Slide 169 text

N = 3 replicas Coordinator read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 170

Slide 170 text

N = 3 replicas Coordinator read(“key”) read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 171

Slide 171 text

N = 3 replicas Coordinator read(“key”) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 172

Slide 172 text

N = 3 replicas Coordinator (“key”, 1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 173

Slide 173 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 174

Slide 174 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”, 1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 175

Slide 175 text

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”, 1) (“key”, 1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Slide 176

Slide 176 text

Coordinator W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

Slide 177

Slide 177 text

Coordinator write(“key”, 2) W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

Slide 178

Slide 178 text

Coordinator write(“key”, 2) W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

Slide 179

Slide 179 text

Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2)

Slide 180

Slide 180 text

Coordinator ack(“key”, 2) ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2)

Slide 181

Slide 181 text

Coordinator Coordinator read(“key”) ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) R=1

Slide 182

Slide 182 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) read(“key”) R=1

Slide 183

Slide 183 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) (“key”, 1) R=1

Slide 184

Slide 184 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) (“key”,1) R=1

Slide 185

Slide 185 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) (“key”,1) R=1

Slide 186

Slide 186 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) (“key”,1) R=1

Slide 187

Slide 187 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) (“key”,1) R=1

Slide 188

Slide 188 text

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3(“key”, 1) (“key”, 2) (“key”,1) ack(“key”, 2) R=1

Slide 189

Slide 189 text

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3 (“key”, 2) (“key”,1) ack(“key”, 2) ack(“key”, 2) (“key”, 2) R=1

Slide 190

Slide 190 text

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3 (“key”, 2) (“key”,1) (“key”, 2) R=1

Slide 191

Slide 191 text

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3 (“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) R=1

Slide 192

Slide 192 text

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3 (“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) (“key”, 2) R=1

Slide 193

Slide 193 text

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3 (“key”, 2) (“key”,1) (“key”, 2) R=1

Slide 194

Slide 194 text

No content

Slide 195

Slide 195 text

keep replicas in sync

Slide 196

Slide 196 text

keep replicas in sync

Slide 197

Slide 197 text

keep replicas in sync

Slide 198

Slide 198 text

keep replicas in sync

Slide 199

Slide 199 text

keep replicas in sync

Slide 200

Slide 200 text

keep replicas in sync

Slide 201

Slide 201 text

keep replicas in sync

Slide 202

Slide 202 text

keep replicas in sync slow

Slide 203

Slide 203 text

keep replicas in sync slow alternative: sync later

Slide 204

Slide 204 text

keep replicas in sync slow alternative: sync later

Slide 205

Slide 205 text

keep replicas in sync slow alternative: sync later

Slide 206

Slide 206 text

keep replicas in sync slow alternative: sync later inconsistent

Slide 207

Slide 207 text

keep replicas in sync slow alternative: sync later inconsistent

Slide 208

Slide 208 text

keep replicas in sync slow alternative: sync later inconsistent

Slide 209

Slide 209 text

keep replicas in sync slow alternative: sync later inconsistent

Slide 210

Slide 210 text

http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/ "In the general case, we typically use [Cassandra’s] consistency level of [R=W=1], which provides maximum performance. Nice!" --D. Williams, “HBase vs Cassandra: why we moved” February 2010

Slide 211

Slide 211 text

http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6

Slide 212

Slide 212 text

http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6

Slide 213

Slide 213 text

consistent? What happens if I don’t wait? How

Slide 214

Slide 214 text

Probability of reading later older than k versions is exponentially reduced by k Pr(reading latest write) = 99% Pr(reading one of last two writes) = 99.9% Pr(reading one of last three writes) = 99.99%

Slide 215

Slide 215 text

cassandra patch VLDB 2012 early print tinyurl.com/pbsvldb tinyurl.com/pbspatch

Slide 216

Slide 216 text

reads return the last written value or newer (defined w.r.t. real time, when the read started) consistency _ _ _ “strong”

Slide 217

Slide 217 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) R=1

Slide 218

Slide 218 text

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) (“key”, 1) (“key”,1) R=1

Slide 219

Slide 219 text

Coordinator Coordinator write(“key”, 2) ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”, 2) (“key”, 1) (“key”,1) R=1 R3 replied before last write arrived!

Slide 220

Slide 220 text

99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency: 1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3

Slide 221

Slide 221 text

99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency: 1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3 59.5% faster

Slide 222

Slide 222 text

1. Tracing 2. Simulation 3. Tune N, R, W 4. Profit Workflow

Slide 223

Slide 223 text

No content

Slide 224

Slide 224 text

No content

Slide 225

Slide 225 text

99.9% consistent reads: R=1, W=1 t = 202.0 ms Latency: 43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3

Slide 226

Slide 226 text

99.9% consistent reads: R=1, W=1 t = 202.0 ms Latency: 43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3 81.1% faster

Slide 227

Slide 227 text

R+W

Slide 228

Slide 228 text

N=3

Slide 229

Slide 229 text

N=3

Slide 230

Slide 230 text

N=3

Slide 231

Slide 231 text

focus on with failures: steady state unavailable or sloppy

Slide 232

Slide 232 text

R1 N = 3 replicas R2 R3 Write to W, read from R replicas

Slide 233

Slide 233 text

R1 N = 3 replicas R2 R3 R=W=3 replicas { } } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection

Slide 234

Slide 234 text

R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1 replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect

Slide 235

Slide 235 text

Coordinator Replica once per replica T i m e

Slide 236

Slide 236 text

Coordinator Replica write once per replica T i m e

Slide 237

Slide 237 text

Coordinator Replica write ack once per replica T i m e

Slide 238

Slide 238 text

Coordinator Replica write ack wait for W responses once per replica T i m e

Slide 239

Slide 239 text

Coordinator Replica write ack wait for W responses t seconds elapse once per replica T i m e

Slide 240

Slide 240 text

Coordinator Replica write ack read wait for W responses t seconds elapse once per replica T i m e

Slide 241

Slide 241 text

Coordinator Replica write ack read response wait for W responses t seconds elapse once per replica T i m e

Slide 242

Slide 242 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses once per replica T i m e

Slide 243

Slide 243 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses once per replica T i m e

Slide 244

Slide 244 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses once per replica T i m e

Slide 245

Slide 245 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

Slide 246

Slide 246 text

Coordinator Replica write ack read response wait for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

Slide 247

Slide 247 text

N=2 T i m e

Slide 248

Slide 248 text

write write N=2 T i m e

Slide 249

Slide 249 text

write ack write ack N=2 T i m e

Slide 250

Slide 250 text

write ack write ack W=1 N=2 T i m e

Slide 251

Slide 251 text

write ack write ack W=1 N=2 T i m e

Slide 252

Slide 252 text

write ack read write ack W=1 N=2 read T i m e

Slide 253

Slide 253 text

write ack read response write ack W=1 N=2 read response T i m e

Slide 254

Slide 254 text

write ack read response write ack W=1 R=1 N=2 read response T i m e

Slide 255

Slide 255 text

write ack read response write ack W=1 R=1 N=2 read response T i m e

Slide 256

Slide 256 text

write ack read response write ack W=1 R=1 N=2 read response T i m e inconsistent

Slide 257

Slide 257 text

N=3 R=W=2 quorum system

Slide 258

Slide 258 text

N=3 R=W=2 quorum system

Slide 259

Slide 259 text

N=3 R=W=2 quorum system

Slide 260

Slide 260 text

Y N=3 R=W=2 quorum system

Slide 261

Slide 261 text

Y N=3 R=W=2 quorum system

Slide 262

Slide 262 text

Y N=3 R=W=2 quorum system

Slide 263

Slide 263 text

Y Y N=3 R=W=2 quorum system

Slide 264

Slide 264 text

Y Y N=3 R=W=2 quorum system

Slide 265

Slide 265 text

Y Y N=3 R=W=2 quorum system

Slide 266

Slide 266 text

Y Y Y N=3 R=W=2 quorum system

Slide 267

Slide 267 text

Y Y Y N=3 R=W=2 quorum system

Slide 268

Slide 268 text

Y Y Y Y Y Y N=3 R=W=2 quorum system

Slide 269

Slide 269 text

Y Y Y Y Y Y N=3 R=W=2 quorum system

Slide 270

Slide 270 text

Y Y Y Y Y Y Y Y Y N=3 R=W=2 quorum system

Slide 271

Slide 271 text

Y Y Y Y Y Y Y Y Y N=3 R=W=2 quorum system

Slide 272

Slide 272 text

Y Y Y Y Y Y Y Y Y guaranteed intersection N=3 R=W=2 quorum system

Slide 273

Slide 273 text

N=3 R=W=1 partial quorum system

Slide 274

Slide 274 text

N=3 R=W=1 partial quorum system

Slide 275

Slide 275 text

N=3 R=W=1 partial quorum system

Slide 276

Slide 276 text

Y N=3 R=W=1 partial quorum system

Slide 277

Slide 277 text

Y N=3 R=W=1 partial quorum system

Slide 278

Slide 278 text

Y N=3 R=W=1 partial quorum system

Slide 279

Slide 279 text

Y N N=3 R=W=1 partial quorum system

Slide 280

Slide 280 text

Y N N N=3 R=W=1 partial quorum system

Slide 281

Slide 281 text

Y N N N=3 R=W=1 partial quorum system

Slide 282

Slide 282 text

Y N N N Y N N=3 R=W=1 partial quorum system

Slide 283

Slide 283 text

Y N N N Y N N=3 R=W=1 partial quorum system

Slide 284

Slide 284 text

Y N N N Y N N N Y N=3 R=W=1 partial quorum system

Slide 285

Slide 285 text

Y N N N Y N N N Y N=3 R=W=1 partial quorum system

Slide 286

Slide 286 text

Y N N N Y N N N Y N=3 R=W=1 partial quorum system

Slide 287

Slide 287 text

Y N N N Y N N N Y N=3 R=W=1 partial quorum system

Slide 288

Slide 288 text

Y N N N Y N N N Y probabilistic intersection N=3 R=W=1 partial quorum system

Slide 289

Slide 289 text

N N Y N=3 R=W=1

Slide 290

Slide 290 text

N N Y expanding quorums grow over time N=3 R=W=1

Slide 291

Slide 291 text

N Y Y expanding quorums grow over time N=3 R=W=1

Slide 292

Slide 292 text

Y Y Y expanding quorums grow over time N=3 R=W=1

Slide 293

Slide 293 text

No content

Slide 294

Slide 294 text

Werner Vogels

Slide 295

Slide 295 text

1994-2004 Werner Vogels

Slide 296

Slide 296 text

1994-2004 2004- Werner Vogels

Slide 297

Slide 297 text

N=3, R=W=2 quorum system

Slide 298

Slide 298 text

N=3, R=W=2 quorum system

Slide 299

Slide 299 text

N=3, R=W=2 quorum system

Slide 300

Slide 300 text

N=3, R=W=2 quorum system

Slide 301

Slide 301 text

N=3, R=W=2 quorum system

Slide 302

Slide 302 text

N=3, R=W=2 quorum system

Slide 303

Slide 303 text

N=3, R=W=2 quorum system

Slide 304

Slide 304 text

N=3, R=W=2 quorum system

Slide 305

Slide 305 text

N=3, R=W=2 quorum system

Slide 306

Slide 306 text

N=3, R=W=2 quorum system

Slide 307

Slide 307 text

guaranteed intersection N=3, R=W=2 quorum system

Slide 308

Slide 308 text

N=3, R=W=1 partial quorum system

Slide 309

Slide 309 text

N=3, R=W=1 partial quorum system

Slide 310

Slide 310 text

N=3, R=W=1 partial quorum system

Slide 311

Slide 311 text

N=3, R=W=1 partial quorum system

Slide 312

Slide 312 text

N=3, R=W=1 partial quorum system

Slide 313

Slide 313 text

N=3, R=W=1 partial quorum system

Slide 314

Slide 314 text

N=3, R=W=1 partial quorum system

Slide 315

Slide 315 text

N=3, R=W=1 partial quorum system

Slide 316

Slide 316 text

N=3, R=W=1 partial quorum system probabilistic intersection

Slide 317

Slide 317 text

expanding quorums N=3, R=W=1 grow over time

Slide 318

Slide 318 text

expanding quorums N=3, R=W=1 grow over time

Slide 319

Slide 319 text

expanding quorums N=3, R=W=1 grow over time

Slide 320

Slide 320 text

expanding quorums N=3, R=W=1 grow over time

Slide 321

Slide 321 text

expanding quorums N=3, R=W=1 grow over time

Slide 322

Slide 322 text

Solving WARS: hard Monte Carlo methods: easier

Slide 323

Slide 323 text

remedy: observation: technique: no guarantees with eventual consistency consistency prediction measure latencies use WARS model PBS

Slide 324

Slide 324 text

PBS allows us to quantify latency-consistency trade-offs what’s the latency cost of consistency? what’s the consistency cost of latency?

Slide 325

Slide 325 text

PBS allows us to quantify latency-consistency trade-offs what’s the latency cost of consistency? what’s the consistency cost of latency? an “SLA” for consistency