Probabilistically Bounded Staleness for Practical Partial Quorums

Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica
PBS

Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica
VLDB 2012 UC Berkeley Probabilistically Bounded Staleness for Practical Partial Quorums PBS

R+W strong consistency

R+W strong consistency eventual consistency

R+W strong consistency higher latency eventual consistency

R+W strong consistency higher latency eventual consistency lower latency

consistency is a choice binary

consistency is a choice binary strong eventual

consistency continuum is a strong eventual

latency vs. consistency our focus:

latency vs. consistency informed by practice our focus:

latency vs. consistency informed by practice our focus: availability, partitions,
failures not in this talk:

quantify eventual consistency: wall-clock time (“how eventual?”) versions (“how consistent?”)
analyze real-world systems: EC is often strongly consistent describe when and why our contributions

intro system model practice metrics insights integration

Dynamo: Amazon’s Highly Available Key-value Store SOSP 2007

Apache, DataStax Project Voldemort Dynamo: Amazon’s Highly Available Key-value Store
SOSP 2007

Adobe Cisco Digg Gowalla IBM Morningstar Netﬂix Palantir Rackspace Reddit
Rhapsody Shazam Spotify Soundcloud Twitter Mozilla Ask.com Yammer Aol GitHub JoyentCloud Best Buy LinkedIn Boeing Comcast Cassandra Riak Voldemort Gilt Groupe

N replicas/key read: wait for R replies write: wait for
W acks

W acks N=3

W acks N=3 R=2

“strong” consistency else: R+W > N if: eventual consistency then:

reads return the last acknowledged write or an in-ﬂight write
(per-key) consistency _ _ _ “strong” regular register R+W > N

Latency LinkedIn disk-based model N=3

99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x
6.13x R Latency LinkedIn disk-based model N=3

99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x
6.13x R W 99th 99.9th 1 1x 1x 2 2.01x 1.9x 3 4.96x 14.96x Latency LinkedIn disk-based model N=3

⇧consistency, ⇧latency wait for more replicas, read more recent data
consistency, ⇧ ⇧ latency wait for fewer replicas, read less recent data

eventual consistency “if no new updates are made to the
object, eventually all accesses will return the last updated value” W. Vogels, CACM 2008 R+W ≤ N

How How long do I have to wait? eventual?

consistent? How What happens if I don’t wait?

Cassandra: R=W=1, N=3 by default (1+1 ≯ 3)

eventual consistency “maximum performance” “very low latency” okay for “most
data” “general case” in the wild

anecdotally, EC “good enough” for many kinds of data

anecdotally, EC “good enough” for many kinds of data How
eventual? How consistent?

anecdotally, EC “good enough” for many kinds of data How
eventual? How consistent? “eventual and consistent enough”

Can we do better?

can’t make promises can give expectations Can we do better?

Probabilistically Bounded Staleness can’t make promises can give expectations Can
we do better?

How How long do I have to wait? eventual?

How eventual?

t-visibility: probability p of consistent reads after t seconds (e.g.,
10ms after write, 99.9% of reads consistent) How eventual?

t-visibility depends on messaging and processing delays

Coordinator Replica once per replica T i m e

Coordinator Replica write once per replica T i m e

Coordinator Replica write ack once per replica T i m
e

Coordinator Replica write ack wait for W responses once per
replica T i m e

Coordinator Replica write ack wait for W responses t seconds
elapse once per replica T i m e

Coordinator Replica write ack read wait for W responses t
seconds elapse once per replica T i m e

Coordinator Replica write ack read response wait for W responses
t seconds elapse once per replica T i m e

t seconds elapse wait for R responses once per replica T i m e

t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

R1 N=2 T i m e Alice R2

R1 write N=2 T i m e Alice R2

write ack N=2 T i m e Alice R2 R1

write ack W=1 N=2 T i m e Alice R2
R1

write ack W=1 N=2 T i m e Alice Bob
R2 R1

write ack read W=1 N=2 T i m e Alice
Bob R2 R1

R2 write ack read W=1 N=2 T i m e
Alice Bob R2 R1

R2 write ack read W=1 N=2 response T i m
e Alice Bob R2 R1

R2 write ack read W=1 R=1 N=2 response T i
m e Alice Bob R2 R1

m e Alice Bob R2 R1 inconsistent

m e Alice Bob R2 R1 R2 inconsistent

write ack read response wait for W responses t seconds
elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e

(W) write ack read response wait for W responses t
seconds elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e

(W) write ack read response wait for W responses t
seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e

(R) (W) write ack read response wait for W responses
t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e

(R) (W) write ack read response wait for W responses
t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) (S) Coordinator Replica T i m e

solving WARS: order statistics dependent variables Instead: Monte Carlo methods

to use WARS: W 53.2 44.5 101.1 ... A 10.3
8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data

to use WARS: W 53.2 44.5 101.1 ... A 10.3
8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5

to use WARS: W 53.2 44.5 101.1 ... A 10.3
8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3

to use WARS: W 53.2 44.5 101.1 ... A 10.3
8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3

to use WARS: W 53.2 44.5 101.1 ... A 10.3
8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3 14.2

real Cassandra cluster varying latencies: t-visibility RMSE: 0.28% latency N-RMSE:
0.48% WARS accuracy

How eventual? key: WARS model need: latencies t-visibility: consistent reads
with probability p after t seconds

Yammer 100K+ companies uses Riak LinkedIn 175M+ users built and
uses Voldemort production latencies ﬁt gaussian mixtures

10 ms N=3

Latency is combined read and write latency at 99.9th percentile
R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms

R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms

R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms worthwhile?

R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms

R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 59.5% faster R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms

10 2 10 1 100 101 102 103 0.2 0.4
0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3

Coordinator Replica write ack (A) (W) response (S) (R) wait
for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica SSDs reduce variance compared to disks! read

Yammer latency 81.1% ⇧ (187ms) 202 ms t-visibility 99.9th N=3

k-staleness (versions) How consistent? monotonic reads quorum load in the
paper

in the paper <k,t>-staleness: versions and time

latency distributions WAN model varying quorum sizes staleness detection in
the paper

1.Tracing 2. Simulation 3. Tune N,R,W Integration Project Voldemort

https://issues.apache.org/jira/browse/CASSANDRA-4261

Related Work Quorum Systems • probabilistic quorums [PODC ’97] •
deterministic k-quorums [DISC ’05, ’06] Consistency Veriﬁcation • Golab et al. [PODC ’11] • Bermbach and Tai [M4WSOC ’11] • Wada et al. [CIDR ’11] • Anderson et al. [HotDep ’10] • Transactional consistency: Zellag and Kemme [ICDE ’11], Fekete et al. [VLDB ’09] Latency-Consistency • Daniel Abadi [Computer ’12] • Kraska et al. [VLDB ’09] Bounded Staleness Guarantees • TACT [OSDI ’00] • FRACS [ICDCS ’03] • AQuA [IEEE TPDS ’03]

consistency is a

consistency continuum is a

consistency continuum is a strong eventual

quantify eventual consistency model staleness in time, versions latency-consistency trade-offs
analyze real systems and hardware PBS

analyze real systems and hardware PBS quantify which choice is best and explain why EC is often strongly consistent

analyze real systems and hardware pbs.cs.berkeley.edu PBS quantify which choice is best and explain why EC is often strongly consistent

Extra Slides

Non-expanding Quorum Systems e.g., probabilistic quorums (PODC ’97) deterministic k-quorums
(DISC ’05, ’06) Bounded Staleness Guarantees e.g., TACT (OSDI ’00), FRACS (ICDCS ’03)

Consistency Veriﬁcation e.g., Golab et al. (PODC ’11), Bermbach and
Tai (M4WSOC ’11), Wada et al. (CIDR ’11) Latency-Consistency Daniel Abadi (IEEE Computer ’12)

PBS and apps

staleness requires either: staleness-tolerant data structures timelines, logs cf. commutative
data structures logical monotonicity asynchronous compensation code detect violations after data is returned; see paper cf. “Building on Quicksand” memories, guesses, apologies write code to ﬁx any errors

minimize: (compensation cost)×(# of expected anomalies) asynchronous compensation

Read only newer data? client’s read rate global write rate
(monotonic reads session guarantee) # versions tolerable staleness = (for a given key)

Failure?

latency spi kes Treat failures as

How l o n g do partitions last?

what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76
consecutive hours down 㱺 bad 8-hour rolling average

what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76
consecutive hours down 㱺 bad 8-hour rolling average hide in tail of distribution OR continuously evaluate SLA, adjust

10 2 10 1 100 101 102 103 0.2 0.4
0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK N=3

10 2 10 1 100 101 102 103 0.2 0.4
0.6 0.8 1.0 R=3 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 (LNKD-SSD and LNKD-DISK identical for reads) N=3

<k,t>-staleness: versions and time

<k,t>-staleness: versions and time approximation: exponentiate t-staleness by k

reads return the last written value or newer (deﬁned w.r.t.
real time, when the read started) consistency _ _ _ “strong”

R1 N = 3 replicas R2 R3 Write to W,
read from R replicas

R1 N = 3 replicas R2 R3 R=W=3 replicas {
} } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection

R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1
replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect

Synthetic, Exponential Distributions N=3, W=1, R=1

Synthetic, Exponential Distributions W 1/4x ARS N=3, W=1, R=1

Synthetic, Exponential Distributions W 1/4x ARS W 10x ARS N=3,
W=1, R=1

concurrent writes: deterministically choose Coordinator R=2 (“key”, 1) (“key”, 2)

N = 3 replicas Coordinator client read R=3 R1 R2
R3 (“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas Coordinator client read(“key”) read R=3 R1
R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas Coordinator read(“key”) client read R=3 R1
R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,
1) client read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

1) client (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)

N = 3 replicas Coordinator read R=3 R1 R2 R3
(“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator read(“key”) read R=3 R1 R2
R3 (“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator (“key”, 1) read R=3 R1
R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) read
R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

1) (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator read R=1 R1 R2 R3
(“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator read(“key”) read R=1 R1 R2
R3 (“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator read(“key”) send read to all
read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator (“key”, 1) send read to
all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

N = 3 replicas Coordinator (“key”, 1) (“key”, 1) send
read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

1) (“key”, 1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client

Coordinator W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

Coordinator write(“key”, 2) W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”,
2)

Coordinator ack(“key”, 2) ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”,
1) (“key”, 2)

Coordinator Coordinator read(“key”) ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”,
1) (“key”, 2) R=1

Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)
(“key”, 2) read(“key”) R=1

(“key”, 2) (“key”, 1) R=1

(“key”, 2) (“key”,1) R=1

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3(“key”,
1) (“key”, 2) (“key”,1) ack(“key”, 2) R=1

(“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3
(“key”, 2) (“key”,1) ack(“key”, 2) ack(“key”, 2) (“key”, 2) R=1

(“key”, 2) (“key”,1) (“key”, 2) R=1

(“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) R=1

(“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) (“key”, 2) R=1

(“key”, 2) (“key”,1) (“key”, 2) R=1

keep replicas in sync

keep replicas in sync slow

keep replicas in sync slow alternative: sync later

keep replicas in sync slow alternative: sync later inconsistent

http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/ "In the general case, we typically use [Cassandra’s] consistency
level of [R=W=1], which provides maximum performance. Nice!" --D. Williams, “HBase vs Cassandra: why we moved” February 2010

http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6

consistent? What happens if I don’t wait? How

Probability of reading later older than k versions is exponentially
reduced by k Pr(reading latest write) = 99% Pr(reading one of last two writes) = 99.9% Pr(reading one of last three writes) = 99.99%

cassandra patch VLDB 2012 early print tinyurl.com/pbsvldb tinyurl.com/pbspatch

reads return the last written value or newer (deﬁned w.r.t.
real time, when the read started) consistency _ _ _ “strong”

(“key”, 2) R=1

(“key”, 2) (“key”, 1) (“key”,1) R=1

Coordinator Coordinator write(“key”, 2) ack(“key”, 2) W=1 R1 R2(“key”, 1)
R3(“key”, 1) (“key”, 2) (“key”, 1) (“key”,1) R=1 R3 replied before last write arrived!

99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency:
1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3

1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3 59.5% faster

1. Tracing 2. Simulation 3. Tune N, R, W 4.
Proﬁt Workﬂow

43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3

43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3 81.1% faster

focus on with failures: steady state unavailable or sloppy

R1 N = 3 replicas R2 R3 Write to W,
read from R replicas

R1 N = 3 replicas R2 R3 R=W=3 replicas {
} } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection

R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1
replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect

Coordinator Replica once per replica T i m e

Coordinator Replica write once per replica T i m e

Coordinator Replica write ack once per replica T i m
e

Coordinator Replica write ack wait for W responses once per
replica T i m e

Coordinator Replica write ack wait for W responses t seconds
elapse once per replica T i m e

Coordinator Replica write ack read wait for W responses t
seconds elapse once per replica T i m e

t seconds elapse once per replica T i m e

t seconds elapse wait for R responses once per replica T i m e

t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e

N=2 T i m e

write write N=2 T i m e

write ack write ack N=2 T i m e

write ack write ack W=1 N=2 T i m e

write ack read write ack W=1 N=2 read T i
m e

write ack read response write ack W=1 N=2 read response
T i m e

write ack read response write ack W=1 R=1 N=2 read
response T i m e

write ack read response write ack W=1 R=1 N=2 read
response T i m e inconsistent

N=3 R=W=2 quorum system

Y N=3 R=W=2 quorum system

Y Y N=3 R=W=2 quorum system

Y Y Y N=3 R=W=2 quorum system

Y Y Y Y Y Y N=3 R=W=2 quorum system

Y Y Y Y Y Y Y Y Y N=3
R=W=2 quorum system

Y Y Y Y Y Y Y Y Y guaranteed
intersection N=3 R=W=2 quorum system

N=3 R=W=1 partial quorum system

Y N=3 R=W=1 partial quorum system

Y N N=3 R=W=1 partial quorum system

Y N N N=3 R=W=1 partial quorum system

Y N N N Y N N=3 R=W=1 partial quorum
system

Y N N N Y N N N Y N=3
R=W=1 partial quorum system

Y N N N Y N N N Y probabilistic
intersection N=3 R=W=1 partial quorum system

N N Y N=3 R=W=1

N N Y expanding quorums grow over time N=3 R=W=1

N Y Y expanding quorums grow over time N=3 R=W=1

Y Y Y expanding quorums grow over time N=3 R=W=1

Werner Vogels

1994-2004 Werner Vogels

1994-2004 2004- Werner Vogels

N=3, R=W=2 quorum system

guaranteed intersection N=3, R=W=2 quorum system

N=3, R=W=1 partial quorum system

N=3, R=W=1 partial quorum system probabilistic intersection

expanding quorums N=3, R=W=1 grow over time

Solving WARS: hard Monte Carlo methods: easier

remedy: observation: technique: no guarantees with eventual consistency consistency prediction
measure latencies use WARS model PBS

PBS allows us to quantify latency-consistency trade-offs what’s the latency
cost of consistency? what’s the consistency cost of latency?

PBS allows us to quantify latency-consistency trade-offs what’s the latency
cost of consistency? what’s the consistency cost of latency? an “SLA” for consistency

Probabilistically Bounded Staleness for Practic...

Probabilistically Bounded Staleness for Practical Partial Quorums

More Decks by pbailis

Other Decks in Technology

Featured

Transcript