Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Probabilistically Bounded Staleness for Practical Partial Quorums

pbailis
August 28, 2012

Probabilistically Bounded Staleness for Practical Partial Quorums

pbailis

August 28, 2012
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica

    VLDB 2012 UC Berkeley Probabilistically Bounded Staleness for Practical Partial Quorums PBS
  2. R+W

  3. quantify eventual consistency: wall-clock time (“how eventual?”) versions (“how consistent?”)

    analyze real-world systems: EC is often strongly consistent describe when and why our contributions
  4. Adobe Cisco Digg Gowalla IBM Morningstar Netflix Palantir Rackspace Reddit

    Rhapsody Shazam Spotify Soundcloud Twitter Mozilla Ask.com Yammer Aol GitHub JoyentCloud Best Buy LinkedIn Boeing Comcast Cassandra Riak Voldemort Gilt Groupe
  5. reads return the last acknowledged write or an in-flight write

    (per-key) consistency _ _ _ “strong” regular register R+W > N
  6. 99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x

    6.13x R Latency LinkedIn disk-based model N=3
  7. 99th 99.9th 1 1x 1x 2 1.59x 2.35x 3 4.8x

    6.13x R W 99th 99.9th 1 1x 1x 2 2.01x 1.9x 3 4.96x 14.96x Latency LinkedIn disk-based model N=3
  8. ⇧consistency, ⇧latency wait for more replicas, read more recent data

    consistency, ⇧ ⇧ latency wait for fewer replicas, read less recent data
  9. ⇧consistency, ⇧latency wait for more replicas, read more recent data

    consistency, ⇧ ⇧ latency wait for fewer replicas, read less recent data
  10. eventual consistency “if no new updates are made to the

    object, eventually all accesses will return the last updated value” W. Vogels, CACM 2008 R+W ≤ N
  11. anecdotally, EC “good enough” for many kinds of data How

    eventual? How consistent? “eventual and consistent enough”
  12. t-visibility: probability p of consistent reads after t seconds (e.g.,

    10ms after write, 99.9% of reads consistent) How eventual?
  13. Coordinator Replica write ack read wait for W responses t

    seconds elapse once per replica T i m e
  14. Coordinator Replica write ack read response wait for W responses

    t seconds elapse once per replica T i m e
  15. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  16. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  17. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  18. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  19. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  20. R2 write ack read W=1 R=1 N=2 response T i

    m e Alice Bob R2 R1 inconsistent
  21. R2 write ack read W=1 R=1 N=2 response T i

    m e Alice Bob R2 R1 inconsistent
  22. R2 write ack read W=1 R=1 N=2 response T i

    m e Alice Bob R2 R1 R2 inconsistent
  23. write ack read response wait for W responses t seconds

    elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e
  24. (W) write ack read response wait for W responses t

    seconds elapse wait for R responses response is stale if read arrives before write once per replica Coordinator Replica T i m e
  25. (W) write ack read response wait for W responses t

    seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e
  26. (R) (W) write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) Coordinator Replica T i m e
  27. (R) (W) write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica (A) (S) Coordinator Replica T i m e
  28. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data
  29. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5
  30. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3
  31. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3
  32. to use WARS: W 53.2 44.5 101.1 ... A 10.3

    8.2 11.3 ... R 15.3 22.4 19.8 ... S 9.6 14.2 6.7 ... run simulation Monte Carlo, sampling gather latency data 44.5 11.3 15.3 14.2
  33. Yammer 100K+ companies uses Riak LinkedIn 175M+ users built and

    uses Voldemort production latencies fit gaussian mixtures
  34. N=3

  35. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms
  36. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms
  37. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 15.01 ms LNKD-DISK N=3 16.5% faster R=2, W=1, t =13.6 ms 99.9% consistent: Latency: 12.53 ms worthwhile?
  38. N=3

  39. N=3

  40. N=3

  41. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms
  42. Latency is combined read and write latency at 99.9th percentile

    R=3, W=1 100% consistent: Latency: 4.20 ms LNKD-SSD N=3 59.5% faster R=1, W=1, t = 1.85 ms 99.9% consistent: Latency: 1.32 ms
  43. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3
  44. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3
  45. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 LNKD-SSD LNKD-DISK LNKD-SSD LNKD-DISK N=3
  46. Coordinator Replica write ack (A) (W) response (S) (R) wait

    for W responses t seconds elapse wait for R responses response is stale if read arrives before write once per replica SSDs reduce variance compared to disks! read
  47. Related Work Quorum Systems • probabilistic quorums [PODC ’97] •

    deterministic k-quorums [DISC ’05, ’06] Consistency Verification • Golab et al. [PODC ’11] • Bermbach and Tai [M4WSOC ’11] • Wada et al. [CIDR ’11] • Anderson et al. [HotDep ’10] • Transactional consistency: Zellag and Kemme [ICDE ’11], Fekete et al. [VLDB ’09] Latency-Consistency • Daniel Abadi [Computer ’12] • Kraska et al. [VLDB ’09] Bounded Staleness Guarantees • TACT [OSDI ’00] • FRACS [ICDCS ’03] • AQuA [IEEE TPDS ’03]
  48. quantify eventual consistency model staleness in time, versions latency-consistency trade-offs

    analyze real systems and hardware PBS quantify which choice is best and explain why EC is often strongly consistent
  49. quantify eventual consistency model staleness in time, versions latency-consistency trade-offs

    analyze real systems and hardware pbs.cs.berkeley.edu PBS quantify which choice is best and explain why EC is often strongly consistent
  50. Non-expanding Quorum Systems e.g., probabilistic quorums (PODC ’97) deterministic k-quorums

    (DISC ’05, ’06) Bounded Staleness Guarantees e.g., TACT (OSDI ’00), FRACS (ICDCS ’03)
  51. Consistency Verification e.g., Golab et al. (PODC ’11), Bermbach and

    Tai (M4WSOC ’11), Wada et al. (CIDR ’11) Latency-Consistency Daniel Abadi (IEEE Computer ’12)
  52. staleness requires either: staleness-tolerant data structures timelines, logs cf. commutative

    data structures logical monotonicity asynchronous compensation code detect violations after data is returned; see paper cf. “Building on Quicksand” memories, guesses, apologies write code to fix any errors
  53. Read only newer data? client’s read rate global write rate

    (monotonic reads session guarantee) # versions tolerable staleness = (for a given key)
  54. what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76

    consecutive hours down 㱺 bad 8-hour rolling average
  55. what time interval? 99.9% uptime/yr 㱺 8.76 hours downtime/yr 8.76

    consecutive hours down 㱺 bad 8-hour rolling average hide in tail of distribution OR continuously evaluate SLA, adjust
  56. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 W=3 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK N=3
  57. 10 2 10 1 100 101 102 103 0.2 0.4

    0.6 0.8 1.0 R=3 -SSD LNKD-DISK YMMR WA NKD-DISK YMMR WAN KD-SSD LNKD-DISK YMMR W LNKD-SSD LNKD-DISK 10 2 10 1 100 101 102 103 0.2 0.4 0.6 0.8 1.0 CDF W=1 10 2 10 1 100 101 102 103 Write Latency (ms) 0.2 0.4 0.6 0.8 1.0 W=2 (LNKD-SSD and LNKD-DISK identical for reads) N=3
  58. reads return the last written value or newer (defined w.r.t.

    real time, when the read started) consistency _ _ _ “strong”
  59. R1 N = 3 replicas R2 R3 R=W=3 replicas {

    } } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection
  60. R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1

    replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect
  61. N = 3 replicas Coordinator client read R=3 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1)
  62. N = 3 replicas Coordinator client read(“key”) read R=3 R1

    R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  63. N = 3 replicas Coordinator read(“key”) client read R=3 R1

    R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  64. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) client read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  65. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) client (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1)
  66. N = 3 replicas Coordinator read R=3 R1 R2 R3

    (“key”, 1) (“key”, 1) (“key”, 1) client
  67. N = 3 replicas Coordinator read(“key”) read R=3 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  68. N = 3 replicas Coordinator read(“key”) read R=3 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  69. N = 3 replicas Coordinator (“key”, 1) read R=3 R1

    R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  70. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) read

    R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  71. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  72. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) (“key”, 1) read R=3 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  73. N = 3 replicas Coordinator read R=1 R1 R2 R3

    (“key”, 1) (“key”, 1) (“key”, 1) client
  74. N = 3 replicas Coordinator read(“key”) read R=1 R1 R2

    R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  75. N = 3 replicas Coordinator read(“key”) send read to all

    read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  76. N = 3 replicas Coordinator (“key”, 1) send read to

    all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  77. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) send

    read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  78. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  79. N = 3 replicas Coordinator (“key”, 1) (“key”, 1) (“key”,

    1) (“key”, 1) send read to all read R=1 R1 R2 R3 (“key”, 1) (“key”, 1) (“key”, 1) client
  80. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3(“key”,

    1) (“key”, 2) (“key”,1) ack(“key”, 2) R=1
  81. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) ack(“key”, 2) ack(“key”, 2) (“key”, 2) R=1
  82. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) R=1
  83. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) R=1
  84. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) (“key”, 2) (“key”, 2) R=1
  85. (“key”, 2) Coordinator Coordinator ack(“key”, 2) W=1 R1 R2 R3

    (“key”, 2) (“key”,1) (“key”, 2) R=1
  86. http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/ "In the general case, we typically use [Cassandra’s] consistency

    level of [R=W=1], which provides maximum performance. Nice!" --D. Williams, “HBase vs Cassandra: why we moved” February 2010
  87. Probability of reading later older than k versions is exponentially

    reduced by k Pr(reading latest write) = 99% Pr(reading one of last two writes) = 99.9% Pr(reading one of last three writes) = 99.99%
  88. reads return the last written value or newer (defined w.r.t.

    real time, when the read started) consistency _ _ _ “strong”
  89. Coordinator Coordinator write(“key”, 2) ack(“key”, 2) W=1 R1 R2(“key”, 1)

    R3(“key”, 1) (“key”, 2) (“key”, 1) (“key”,1) R=1 R3 replied before last write arrived!
  90. 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency:

    1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3
  91. 99.9% consistent reads: R=1, W=1 t = 1.85 ms Latency:

    1.32 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 4.20 ms LNKD-SSD N=3 59.5% faster
  92. 99.9% consistent reads: R=1, W=1 t = 202.0 ms Latency:

    43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3
  93. 99.9% consistent reads: R=1, W=1 t = 202.0 ms Latency:

    43.3 ms Latency is combined read and write latency at 99.9th percentile 100% consistent reads: R=3, W=1 Latency: 230.06 ms YMMR N=3 81.1% faster
  94. R+W

  95. N=3

  96. N=3

  97. N=3

  98. R1 N = 3 replicas R2 R3 R=W=3 replicas {

    } } { R1 R2 R3 R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection
  99. R1 N = 3 replicas R2 R3 R=W=3 replicas R=W=1

    replicas { } } { R1 R2 R3 { } R1 } { R2 } { R3 } { R=W=2 replicas { } R1 { R2 } R2 { R3 } R1 { R3 } Write to W, read from R replicas quorum system: guaranteed intersection partial quorum system: may not intersect
  100. Coordinator Replica write ack read wait for W responses t

    seconds elapse once per replica T i m e
  101. Coordinator Replica write ack read response wait for W responses

    t seconds elapse once per replica T i m e
  102. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  103. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  104. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses once per replica T i m e
  105. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  106. Coordinator Replica write ack read response wait for W responses

    t seconds elapse wait for R responses response is stale if read arrives before write once per replica T i m e
  107. Y Y Y Y Y Y Y Y Y N=3

    R=W=2 quorum system
  108. Y Y Y Y Y Y Y Y Y N=3

    R=W=2 quorum system
  109. Y Y Y Y Y Y Y Y Y guaranteed

    intersection N=3 R=W=2 quorum system
  110. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  111. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  112. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  113. Y N N N Y N N N Y N=3

    R=W=1 partial quorum system
  114. Y N N N Y N N N Y probabilistic

    intersection N=3 R=W=1 partial quorum system
  115. PBS allows us to quantify latency-consistency trade-offs what’s the latency

    cost of consistency? what’s the consistency cost of latency?
  116. PBS allows us to quantify latency-consistency trade-offs what’s the latency

    cost of consistency? what’s the consistency cost of latency? an “SLA” for consistency