Pro Yearly is on sale from $80 to $50! »

Probabilistically Bounded Staleness for Practical Partial Quorums

B7dc26518988058faa50712248c80bd3?s=47 pbailis
August 28, 2012

Probabilistically Bounded Staleness for Practical Partial Quorums

B7dc26518988058faa50712248c80bd3?s=128

pbailis

August 28, 2012
Tweet

Transcript

  1. Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica

    PBS
  2. Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica

    VLDB 2012 UC Berkeley Probabilistically Bounded Staleness for Practical Partial Quorums PBS
  3. R+W

  4. R+W strong consistency

  5. R+W strong consistency eventual consistency

  6. R+W strong consistency higher latency eventual consistency

  7. R+W strong consistency higher latency eventual consistency lower latency

  8. R+W strong consistency higher latency eventual consistency lower latency

  9. consistency is a choice binary

  10. consistency is a choice binary strong eventual

  11. consistency continuum is a strong eventual

  12. consistency continuum is a strong eventual

  13. consistency continuum is a strong eventual

  14. latency vs. consistency our focus:

  15. latency vs. consistency our focus:

  16. latency vs. consistency informed by practice our focus:

  17. latency vs. consistency informed by practice our focus: availability, partitions,

    failures not in this talk:
  18. quantify eventual consistency: wall-clock time (“how eventual?”) versions (“how consistent?”)

    analyze real-world systems: EC is often strongly consistent describe when and why our contributions
  19. intro system model practice metrics insights integration

  20. Dynamo: Amazon’s Highly Available Key-value Store SOSP 2007

  21. Apache, DataStax Project Voldemort Dynamo: Amazon’s Highly Available Key-value Store

    SOSP 2007
  22. Adobe Cisco Digg Gowalla IBM Morningstar Netflix Palantir Rackspace Reddit

    Rhapsody Shazam Spotify Soundcloud Twitter Mozilla Ask.com Yammer Aol GitHub JoyentCloud Best Buy LinkedIn Boeing Comcast Cassandra Riak Voldemort Gilt Groupe
  23. N replicas/key read: wait for R replies write: wait for

    W acks
  24. N replicas/key read: wait for R replies write: wait for

    W acks
  25. N replicas/key read: wait for R replies write: wait for

    W acks N=3
  26. N replicas/key read: wait for R replies write: wait for

    W acks N=3
  27. N replicas/key read: wait for R replies write: wait for

    W acks N=3
  28. N replicas/key read: wait for R replies write: wait for

    W acks N=3 R=2
  29. N replicas/key read: wait for R replies write: wait for

    W acks N=3 R=2
  30. “strong” consistency else: R+W > N if: eventual consistency then:

  31. reads return the last acknowledged write or an in-flight write

    (per-key) consistency _ _ _ “strong” regular register R+W > N
  32. Latency LinkedIn disk-based model N=3

  33. 99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x

    6.13x R Latency LinkedIn disk-based model N=3
  34. 99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x

    6.13x R W 99th 99.9th 1 1x 1x 2 2.01x 1.9x 3 4.96x 14.96x Latency LinkedIn disk-based model N=3
  35. ⇧consistency, ⇧latency wait for more replicas, read more recent data

    consistency, ⇧ ⇧ latency wait for fewer replicas, read less recent data
  36. ⇧consistency, ⇧latency wait for more replicas, read more recent data

    consistency, ⇧ ⇧ latency wait for fewer replicas, read less recent data
  37. eventual consistency “if no new updates are made to the

    object, eventually all accesses will return the last updated value” W. Vogels, CACM 2008 R+W ≤ N
  38. How How long do I have to wait? eventual?

  39. consistent? How What happens if I don’t wait?

  40. R+W strong consistency higher latency eventual consistency lower latency

  41. R+W strong consistency higher latency eventual consistency lower latency

  42. R+W strong consistency higher latency eventual consistency lower latency

  43. intro system model practice metrics insights integration

  44. Cassandra: R=W=1, N=3 by default (1+1 ≯ 3)

  45. eventual consistency “maximum performance” “very low latency” okay for “most

    data” “general case” in the wild
  46. anecdotally, EC “good enough” for many kinds of data

  47. anecdotally, EC “good enough” for many kinds of data How

    eventual? How consistent?
  48. anecdotally, EC “good enough” for many kinds of data How

    eventual? How consistent? “eventual and consistent enough”
  49. Can we do better?

  50. can’t make promises can give expectations Can we do better?

  51. Probabilistically Bounded Staleness can’t make promises can give expectations Can

    we do better?
  52. intro system model practice metrics insights integration

  53. How How long do I have to wait? eventual?

  54. How eventual?

  55. t-visibility: probability p of consistent reads after t seconds (e.g.,

    10ms after write, 99.9% of reads consistent) How eventual?
  56. t-visibility depends on messaging and processing delays

  57. Coordinator Replica once per replica T i m e

  58. Coordinator Replica write once per replica T i m e

  59. Coordinator Replica write ack once per replica T i m

    e
  60. Coordinator Replica write ack wait for W responses once per

    replica T i m e
  61. Coordinator Replica write ack wait for W responses t seconds

    elapse once per replica T i m e
  62. Coordinator Replica write ack read wait for W responses t

    seconds elapse once per replica T i m e
  63. Coordinator Replica write ack read response wait for W responses

    t seconds elapse once per replica T i m e
  64. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  65. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  66. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  67. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  68. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  69. R1 N=2 T i m e Alice R2

  70. R1 write N=2 T i m e Alice R2

  71. write ack N=2 T i m e Alice R2 R1

  72. write ack W=1 N=2 T i m e Alice R2

    R1
  73. write ack W=1 N=2 T i m e Alice R2

    R1
  74. write ack W=1 N=2 T i m e Alice Bob

    R2 R1
  75. write ack read W=1 N=2 T i m e Alice

    Bob R2 R1
  76. R2 write ack read W=1 N=2 T i m e

    Alice Bob R2 R1
  77. R2 write ack read W=1 N=2 response T i m

    e Alice Bob R2 R1
  78. R2 write ack read W=1 R=1 N=2 response T i

    m e Alice Bob R2 R1
  79. R2 write ack read W=1 R=1 N=2 response T i

    m e Alice Bob R2 R1 inconsistent
  80. R2 write ack read W=1 R=1 N=2 response T i

    m e Alice Bob R2 R1 inconsistent
  81. R2 write ack read W=1 R=1 N=2 response T i

    m e Alice Bob R2 R1 R2 inconsistent
  82. write ack read response wait for W responses t seconds

    elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e
  83. (W) write ack read response wait for W responses t

    seconds elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e
  84. (W) write ack read response wait for W responses t

    seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e
  85. (R) (W) write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e
  86. (R) (W) write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) (S) Coordinator Replica T i m e
  87. solving WARS: order statistics dependent variables Instead: Monte Carlo methods

  88. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data
  89. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5
  90. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3
  91. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3
  92. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3 14.2
  93. real Cassandra cluster varying latencies: t-visibility RMSE: 0.28% latency N-RMSE:

    0.48% WARS accuracy
  94. How eventual? key: WARS model need: latencies t-visibility: consistent reads

    with probability p after t seconds
  95. intro system model practice metrics insights integration

  96. Yammer 100K+ companies uses Riak LinkedIn 175M+ users built and

    uses Voldemort production latencies fit gaussian mixtures
  97. N=3

  98. 10 ms N=3

  99. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms
  100. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms
  101. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms worthwhile?
  102. N=3

  103. N=3

  104. N=3

  105. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms
  106. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 59.5% faster R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms
  107. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3
  108. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3
  109. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3
  110. Coordinator Replica write ack (A) (W) response (S) (R) wait

    for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica SSDs reduce variance compared to disks! read
  111. Yammer latency 81.1% ⇧ (187ms) 202 ms t-visibility 99.9th N=3

  112. k-staleness (versions) How consistent? monotonic reads quorum load in the

    paper
  113. in the paper <k,t>-staleness: versions and time

  114. latency distributions WAN model varying quorum sizes staleness detection in

    the paper
  115. intro system model practice metrics insights integration

  116. 1.Tracing 2. Simulation 3. Tune N,R,W Integration Project Voldemort

  117. https://issues.apache.org/jira/browse/CASSANDRA-4261

  118. None
  119. None
  120. Related Work Quorum Systems • probabilistic quorums [PODC ’97] •

    deterministic k-quorums [DISC ’05, ’06] Consistency Verification • Golab et al. [PODC ’11] • Bermbach and Tai [M4WSOC ’11] • Wada et al. [CIDR ’11] • Anderson et al. [HotDep ’10] • Transactional consistency: Zellag and Kemme [ICDE ’11], Fekete et al. [VLDB ’09] Latency-Consistency • Daniel Abadi [Computer ’12] • Kraska et al. [VLDB ’09] Bounded Staleness Guarantees • TACT [OSDI ’00] • FRACS [ICDCS ’03] • AQuA [IEEE TPDS ’03]
  121. R+W strong consistency higher latency eventual consistency lower latency

  122. consistency is a

  123. consistency continuum is a

  124. consistency continuum is a strong eventual

  125. consistency continuum is a strong eventual

  126. quantify eventual consistency model staleness in time, versions latency-consistency trade-offs

    analyze real systems and hardware PBS
  127. quantify eventual consistency model staleness in time, versions latency-consistency trade-offs

    analyze real systems and hardware PBS quantify which choice is best and explain why EC is often strongly consistent
  128. quantify eventual consistency model staleness in time, versions latency-consistency trade-offs

    analyze real systems and hardware pbs.cs.berkeley.edu PBS quantify which choice is best and explain why EC is often strongly consistent
  129. Extra Slides

  130. Non-expanding Quorum Systems e.g., probabilistic quorums (PODC ’97) deterministic k-quorums

    (DISC ’05, ’06) Bounded Staleness Guarantees e.g., TACT (OSDI ’00), FRACS (ICDCS ’03)
  131. Consistency Verification e.g., Golab et al. (PODC ’11), Bermbach and

    Tai (M4WSOC ’11), Wada et al. (CIDR ’11) Latency-Consistency Daniel Abadi (IEEE Computer ’12)
  132. PBS and apps

  133. staleness requires either: staleness-tolerant data structures timelines, logs cf. commutative

    data structures logical monotonicity asynchronous compensation code detect violations after data is returned; see paper cf. “Building on Quicksand” memories, guesses, apologies write code to fix any errors
  134. minimize: (compensation cost)×(# of expected anomalies) asynchronous compensation

  135. Read only newer data? client’s read rate global write rate

    (monotonic reads session guarantee) # versions tolerable staleness = (for a given key)
  136. Failure?

  137. latency spi kes Treat failures as

  138. How l o n g do partitions last?

  139. what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76

    consecutive hours down 㱺 bad 8-hour rolling average
  140. what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76

    consecutive hours down 㱺 bad 8-hour rolling average hide in tail of distribution OR continuously evaluate SLA, adjust
  141. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK N=3
  142. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 R=3 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 (LNKD-SSD and LNKD-DISK identical for reads) N=3
  143. <k,t>-staleness: versions and time

  144. <k,t>-staleness: versions and time approximation: exponentiate t-staleness by k

  145. reads return the last written value or newer (defined w.r.t.

    real time, when the read started) consistency _ _ _ “strong”
  146. R1 N = 3 replicas R2 R3 Write to W,

    read from R replicas
  147. R1 N = 3 replicas R2 R3 R=W=3 replicas {

    } } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection
  148. R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1

    replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect
  149. Synthetic, Exponential Distributions N=3, W=1, R=1

  150. Synthetic, Exponential Distributions W 1/4x ARS N=3, W=1, R=1

  151. Synthetic, Exponential Distributions W 1/4x ARS W 10x ARS N=3,

    W=1, R=1
  152. concurrent writes: deterministically choose Coordinator R=2 (“key”, 1) (“key”, 2)

  153. None
  154. None
  155. None
  156. None
  157. N = 3 replicas Coordinator client read R=3 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1)
  158. N = 3 replicas Coordinator client read(“key”) read R=3 R1

    R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  159. N = 3 replicas Coordinator read(“key”) client read R=3 R1

    R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  160. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) client read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  161. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) client (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  162. N = 3 replicas Coordinator read R=3 R1 R2 R3

    (“key”, 1) (“key”, 1) (“key”, 1) client
  163. N = 3 replicas Coordinator read(“key”) read R=3 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  164. N = 3 replicas Coordinator read(“key”) read R=3 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  165. N = 3 replicas Coordinator (“key”, 1) read R=3 R1

    R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  166. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) read

    R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  167. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  168. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  169. N = 3 replicas Coordinator read R=1 R1 R2 R3

    (“key”, 1) (“key”, 1) (“key”, 1) client
  170. N = 3 replicas Coordinator read(“key”) read R=1 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  171. N = 3 replicas Coordinator read(“key”) send read to all

    read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  172. N = 3 replicas Coordinator (“key”, 1) send read to

    all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  173. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) send

    read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  174. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  175. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) (“key”, 1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  176. Coordinator W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

  177. Coordinator write(“key”, 2) W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

  178. Coordinator write(“key”, 2) W=1 R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

  179. Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1) (“key”,

    2)
  180. Coordinator ack(“key”, 2) ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”,

    1) (“key”, 2)
  181. Coordinator Coordinator read(“key”) ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”,

    1) (“key”, 2) R=1
  182. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) read(“key”) R=1
  183. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) (“key”, 1) R=1
  184. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) (“key”,1) R=1
  185. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) (“key”,1) R=1
  186. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) (“key”,1) R=1
  187. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) (“key”,1) R=1
  188. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3(“key”,

    1) (“key”, 2) (“key”,1) ack(“key”, 2) R=1
  189. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) ack(“key”, 2) ack(“key”, 2) (“key”, 2) R=1
  190. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) R=1
  191. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) R=1
  192. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) (“key”, 2) R=1
  193. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) R=1
  194. None
  195. keep replicas in sync

  196. keep replicas in sync

  197. keep replicas in sync

  198. keep replicas in sync

  199. keep replicas in sync

  200. keep replicas in sync

  201. keep replicas in sync

  202. keep replicas in sync slow

  203. keep replicas in sync slow alternative: sync later

  204. keep replicas in sync slow alternative: sync later

  205. keep replicas in sync slow alternative: sync later

  206. keep replicas in sync slow alternative: sync later inconsistent

  207. keep replicas in sync slow alternative: sync later inconsistent

  208. keep replicas in sync slow alternative: sync later inconsistent

  209. keep replicas in sync slow alternative: sync later inconsistent

  210. http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/ "In the general case, we typically use [Cassandra’s] consistency

    level of [R=W=1], which provides maximum performance. Nice!" --D. Williams, “HBase vs Cassandra: why we moved” February 2010
  211. http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6

  212. http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6

  213. consistent? What happens if I don’t wait? How

  214. Probability of reading later older than k versions is exponentially

    reduced by k Pr(reading latest write) = 99% Pr(reading one of last two writes) = 99.9% Pr(reading one of last three writes) = 99.99%
  215. cassandra patch VLDB 2012 early print tinyurl.com/pbsvldb tinyurl.com/pbspatch

  216. reads return the last written value or newer (defined w.r.t.

    real time, when the read started) consistency _ _ _ “strong”
  217. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) R=1
  218. Coordinator Coordinator ack(“key”, 2) W=1 R1 R2(“key”, 1) R3(“key”, 1)

    (“key”, 2) (“key”, 1) (“key”,1) R=1
  219. Coordinator Coordinator write(“key”, 2) ack(“key”, 2) W=1 R1 R2(“key”, 1)

    R3(“key”, 1) (“key”, 2) (“key”, 1) (“key”,1) R=1 R3 replied before last write arrived!
  220. 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency:

    1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3
  221. 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency:

    1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3 59.5% faster
  222. 1. Tracing 2. Simulation 3. Tune N, R, W 4.

    Profit Workflow
  223. None
  224. None
  225. 99.9% consistent reads: R=1, W=1 t = 202.0 ms Latency:

    43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3
  226. 99.9% consistent reads: R=1, W=1 t = 202.0 ms Latency:

    43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3 81.1% faster
  227. R+W

  228. N=3

  229. N=3

  230. N=3

  231. focus on with failures: steady state unavailable or sloppy

  232. R1 N = 3 replicas R2 R3 Write to W,

    read from R replicas
  233. R1 N = 3 replicas R2 R3 R=W=3 replicas {

    } } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection
  234. R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1

    replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect
  235. Coordinator Replica once per replica T i m e

  236. Coordinator Replica write once per replica T i m e

  237. Coordinator Replica write ack once per replica T i m

    e
  238. Coordinator Replica write ack wait for W responses once per

    replica T i m e
  239. Coordinator Replica write ack wait for W responses t seconds

    elapse once per replica T i m e
  240. Coordinator Replica write ack read wait for W responses t

    seconds elapse once per replica T i m e
  241. Coordinator Replica write ack read response wait for W responses

    t seconds elapse once per replica T i m e
  242. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  243. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  244. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  245. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  246. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  247. N=2 T i m e

  248. write write N=2 T i m e

  249. write ack write ack N=2 T i m e

  250. write ack write ack W=1 N=2 T i m e

  251. write ack write ack W=1 N=2 T i m e

  252. write ack read write ack W=1 N=2 read T i

    m e
  253. write ack read response write ack W=1 N=2 read response

    T i m e
  254. write ack read response write ack W=1 R=1 N=2 read

    response T i m e
  255. write ack read response write ack W=1 R=1 N=2 read

    response T i m e
  256. write ack read response write ack W=1 R=1 N=2 read

    response T i m e inconsistent
  257. N=3 R=W=2 quorum system

  258. N=3 R=W=2 quorum system

  259. N=3 R=W=2 quorum system

  260. Y N=3 R=W=2 quorum system

  261. Y N=3 R=W=2 quorum system

  262. Y N=3 R=W=2 quorum system

  263. Y Y N=3 R=W=2 quorum system

  264. Y Y N=3 R=W=2 quorum system

  265. Y Y N=3 R=W=2 quorum system

  266. Y Y Y N=3 R=W=2 quorum system

  267. Y Y Y N=3 R=W=2 quorum system

  268. Y Y Y Y Y Y N=3 R=W=2 quorum system

  269. Y Y Y Y Y Y N=3 R=W=2 quorum system

  270. Y Y Y Y Y Y Y Y Y N=3

    R=W=2 quorum system
  271. Y Y Y Y Y Y Y Y Y N=3

    R=W=2 quorum system
  272. Y Y Y Y Y Y Y Y Y guaranteed

    intersection N=3 R=W=2 quorum system
  273. N=3 R=W=1 partial quorum system

  274. N=3 R=W=1 partial quorum system

  275. N=3 R=W=1 partial quorum system

  276. Y N=3 R=W=1 partial quorum system

  277. Y N=3 R=W=1 partial quorum system

  278. Y N=3 R=W=1 partial quorum system

  279. Y N N=3 R=W=1 partial quorum system

  280. Y N N N=3 R=W=1 partial quorum system

  281. Y N N N=3 R=W=1 partial quorum system

  282. Y N N N Y N N=3 R=W=1 partial quorum

    system
  283. Y N N N Y N N=3 R=W=1 partial quorum

    system
  284. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  285. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  286. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  287. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  288. Y N N N Y N N N Y probabilistic

    intersection N=3 R=W=1 partial quorum system
  289. N N Y N=3 R=W=1

  290. N N Y expanding quorums grow over time N=3 R=W=1

  291. N Y Y expanding quorums grow over time N=3 R=W=1

  292. Y Y Y expanding quorums grow over time N=3 R=W=1

  293. None
  294. Werner Vogels

  295. 1994-2004 Werner Vogels

  296. 1994-2004 2004- Werner Vogels

  297. N=3, R=W=2 quorum system

  298. N=3, R=W=2 quorum system

  299. N=3, R=W=2 quorum system

  300. N=3, R=W=2 quorum system

  301. N=3, R=W=2 quorum system

  302. N=3, R=W=2 quorum system

  303. N=3, R=W=2 quorum system

  304. N=3, R=W=2 quorum system

  305. N=3, R=W=2 quorum system

  306. N=3, R=W=2 quorum system

  307. guaranteed intersection N=3, R=W=2 quorum system

  308. N=3, R=W=1 partial quorum system

  309. N=3, R=W=1 partial quorum system

  310. N=3, R=W=1 partial quorum system

  311. N=3, R=W=1 partial quorum system

  312. N=3, R=W=1 partial quorum system

  313. N=3, R=W=1 partial quorum system

  314. N=3, R=W=1 partial quorum system

  315. N=3, R=W=1 partial quorum system

  316. N=3, R=W=1 partial quorum system probabilistic intersection

  317. expanding quorums N=3, R=W=1 grow over time

  318. expanding quorums N=3, R=W=1 grow over time

  319. expanding quorums N=3, R=W=1 grow over time

  320. expanding quorums N=3, R=W=1 grow over time

  321. expanding quorums N=3, R=W=1 grow over time

  322. Solving WARS: hard Monte Carlo methods: easier

  323. remedy: observation: technique: no guarantees with eventual consistency consistency prediction

    measure latencies use WARS model PBS
  324. PBS allows us to quantify latency-consistency trade-offs what’s the latency

    cost of consistency? what’s the consistency cost of latency?
  325. PBS allows us to quantify latency-consistency trade-offs what’s the latency

    cost of consistency? what’s the consistency cost of latency? an “SLA” for consistency