Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Probabilistically Bounded Staleness for Practical Partial Quorums

pbailis
August 28, 2012

Probabilistically Bounded Staleness for Practical Partial Quorums

pbailis

August 28, 2012
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. Peter Bailis, Shivaram Venkataraman,
    Mike Franklin, Joe Hellerstein, Ion Stoica
    PBS

    View Slide

  2. Peter Bailis, Shivaram Venkataraman,
    Mike Franklin, Joe Hellerstein, Ion Stoica
    VLDB
    2012
    UC Berkeley
    Probabilistically Bounded Staleness
    for Practical Partial Quorums
    PBS

    View Slide

  3. R+W

    View Slide

  4. R+W
    strong
    consistency

    View Slide

  5. R+W
    strong
    consistency
    eventual
    consistency

    View Slide

  6. R+W
    strong
    consistency
    higher
    latency
    eventual
    consistency

    View Slide

  7. R+W
    strong
    consistency
    higher
    latency
    eventual
    consistency
    lower
    latency

    View Slide

  8. R+W
    strong
    consistency
    higher
    latency
    eventual
    consistency
    lower
    latency

    View Slide

  9. consistency
    is a choice
    binary

    View Slide

  10. consistency
    is a choice
    binary
    strong eventual

    View Slide

  11. consistency
    continuum
    is a
    strong eventual

    View Slide

  12. consistency
    continuum
    is a
    strong eventual

    View Slide

  13. consistency
    continuum
    is a
    strong eventual

    View Slide

  14. latency vs.
    consistency
    our focus:

    View Slide

  15. latency vs.
    consistency
    our focus:

    View Slide

  16. latency vs.
    consistency
    informed by practice
    our focus:

    View Slide

  17. latency vs.
    consistency
    informed by practice
    our focus:
    availability, partitions,
    failures
    not in this talk:

    View Slide

  18. quantify eventual consistency:
    wall-clock time (“how eventual?”)
    versions (“how consistent?”)
    analyze real-world systems:
    EC is often strongly consistent
    describe when and why
    our contributions

    View Slide

  19. intro
    system model
    practice
    metrics
    insights
    integration

    View Slide

  20. Dynamo:
    Amazon’s Highly Available Key-value Store
    SOSP 2007

    View Slide

  21. Apache, DataStax
    Project Voldemort
    Dynamo:
    Amazon’s Highly Available Key-value Store
    SOSP 2007

    View Slide

  22. Adobe
    Cisco
    Digg
    Gowalla
    IBM
    Morningstar
    Netflix
    Palantir
    Rackspace
    Reddit
    Rhapsody
    Shazam
    Spotify
    Soundcloud
    Twitter
    Mozilla
    Ask.com
    Yammer
    Aol
    GitHub
    JoyentCloud
    Best Buy
    LinkedIn
    Boeing
    Comcast
    Cassandra
    Riak
    Voldemort
    Gilt Groupe

    View Slide

  23. N replicas/key
    read: wait for R replies
    write: wait for W acks

    View Slide

  24. N replicas/key
    read: wait for R replies
    write: wait for W acks

    View Slide

  25. N replicas/key
    read: wait for R replies
    write: wait for W acks
    N=3

    View Slide

  26. N replicas/key
    read: wait for R replies
    write: wait for W acks
    N=3

    View Slide

  27. N replicas/key
    read: wait for R replies
    write: wait for W acks
    N=3

    View Slide

  28. N replicas/key
    read: wait for R replies
    write: wait for W acks
    N=3
    R=2

    View Slide

  29. N replicas/key
    read: wait for R replies
    write: wait for W acks
    N=3
    R=2

    View Slide

  30. “strong”
    consistency
    else:
    R+W > N
    if:
    eventual
    consistency
    then:

    View Slide

  31. reads return the last
    acknowledged write or an
    in-flight write (per-key)
    consistency
    _
    _
    _
    “strong”
    regular register
    R+W > N

    View Slide

  32. Latency
    LinkedIn
    disk-based
    model
    N=3

    View Slide

  33. 99th 99.9th
    1 1x 1x
    2 1.59x 2.35x
    3 4.8x 6.13x
    R
    Latency
    LinkedIn
    disk-based
    model
    N=3

    View Slide

  34. 99th 99.9th
    1 1x 1x
    2 1.59x 2.35x
    3 4.8x 6.13x
    R
    W
    99th 99.9th
    1 1x 1x
    2 2.01x 1.9x
    3 4.96x 14.96x
    Latency
    LinkedIn
    disk-based
    model
    N=3

    View Slide

  35. ⇧consistency, ⇧latency
    wait for more replicas,
    read more recent data
    consistency,


    latency
    wait for fewer replicas,
    read less recent data

    View Slide

  36. ⇧consistency, ⇧latency
    wait for more replicas,
    read more recent data
    consistency,


    latency
    wait for fewer replicas,
    read less recent data

    View Slide

  37. eventual
    consistency
    “if no new updates are
    made to the object,
    eventually all accesses
    will return the last
    updated value”
    W. Vogels, CACM 2008
    R+W ≤ N

    View Slide

  38. How
    How long do I have to wait?
    eventual?

    View Slide

  39. consistent?
    How
    What happens if I don’t wait?

    View Slide

  40. R+W
    strong
    consistency
    higher
    latency
    eventual
    consistency
    lower
    latency

    View Slide

  41. R+W
    strong
    consistency
    higher
    latency
    eventual
    consistency
    lower
    latency

    View Slide

  42. R+W
    strong
    consistency
    higher
    latency
    eventual
    consistency
    lower
    latency

    View Slide

  43. intro
    system model
    practice
    metrics
    insights
    integration

    View Slide

  44. Cassandra:
    R=W=1, N=3
    by default
    (1+1 ≯ 3)

    View Slide

  45. eventual consistency
    “maximum
    performance”
    “very low
    latency”
    okay for
    “most data”
    “general
    case”
    in the wild

    View Slide

  46. anecdotally, EC
    “good enough” for
    many kinds of data

    View Slide

  47. anecdotally, EC
    “good enough” for
    many kinds of data
    How eventual?
    How consistent?

    View Slide

  48. anecdotally, EC
    “good enough” for
    many kinds of data
    How eventual?
    How consistent?
    “eventual and consistent enough”

    View Slide

  49. Can we do better?

    View Slide

  50. can’t make promises
    can give expectations
    Can we do better?

    View Slide

  51. Probabilistically
    Bounded Staleness
    can’t make promises
    can give expectations
    Can we do better?

    View Slide

  52. intro
    system model
    practice
    metrics
    insights
    integration

    View Slide

  53. How
    How long do I have to wait?
    eventual?

    View Slide

  54. How eventual?

    View Slide

  55. t-visibility: probability p
    of consistent reads after
    t seconds
    (e.g., 10ms after write, 99.9% of reads consistent)
    How eventual?

    View Slide

  56. t-visibility depends on
    messaging and
    processing delays

    View Slide

  57. Coordinator Replica
    once per replica T
    i
    m
    e

    View Slide

  58. Coordinator Replica
    write
    once per replica T
    i
    m
    e

    View Slide

  59. Coordinator Replica
    write
    ack
    once per replica T
    i
    m
    e

    View Slide

  60. Coordinator Replica
    write
    ack
    wait for W
    responses
    once per replica T
    i
    m
    e

    View Slide

  61. Coordinator Replica
    write
    ack
    wait for W
    responses
    t seconds elapse
    once per replica T
    i
    m
    e

    View Slide

  62. Coordinator Replica
    write
    ack
    read
    wait for W
    responses
    t seconds elapse
    once per replica T
    i
    m
    e

    View Slide

  63. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    once per replica T
    i
    m
    e

    View Slide

  64. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    once per replica T
    i
    m
    e

    View Slide

  65. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica T
    i
    m
    e

    View Slide

  66. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica T
    i
    m
    e

    View Slide

  67. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica T
    i
    m
    e

    View Slide

  68. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica T
    i
    m
    e

    View Slide

  69. R1
    N=2 T
    i
    m
    e
    Alice
    R2

    View Slide

  70. R1
    write
    N=2 T
    i
    m
    e
    Alice
    R2

    View Slide

  71. write
    ack
    N=2 T
    i
    m
    e
    Alice
    R2
    R1

    View Slide

  72. write
    ack
    W=1
    N=2 T
    i
    m
    e
    Alice
    R2
    R1

    View Slide

  73. write
    ack
    W=1
    N=2 T
    i
    m
    e
    Alice
    R2
    R1

    View Slide

  74. write
    ack
    W=1
    N=2 T
    i
    m
    e
    Alice
    Bob
    R2
    R1

    View Slide

  75. write
    ack
    read
    W=1
    N=2 T
    i
    m
    e
    Alice
    Bob
    R2
    R1

    View Slide

  76. R2
    write
    ack
    read
    W=1
    N=2 T
    i
    m
    e
    Alice
    Bob
    R2
    R1

    View Slide

  77. R2
    write
    ack
    read
    W=1
    N=2
    response
    T
    i
    m
    e
    Alice
    Bob
    R2
    R1

    View Slide

  78. R2
    write
    ack
    read
    W=1
    R=1
    N=2
    response
    T
    i
    m
    e
    Alice
    Bob
    R2
    R1

    View Slide

  79. R2
    write
    ack
    read
    W=1
    R=1
    N=2
    response
    T
    i
    m
    e
    Alice
    Bob
    R2
    R1
    inconsistent

    View Slide

  80. R2
    write
    ack
    read
    W=1
    R=1
    N=2
    response
    T
    i
    m
    e
    Alice
    Bob
    R2
    R1
    inconsistent

    View Slide

  81. R2
    write
    ack
    read
    W=1
    R=1
    N=2
    response
    T
    i
    m
    e
    Alice
    Bob
    R2
    R1
    R2
    inconsistent

    View Slide

  82. write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica
    Coordinator Replica T
    i
    m
    e

    View Slide

  83. (W)
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica
    Coordinator Replica T
    i
    m
    e

    View Slide

  84. (W)
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica
    (A)
    Coordinator Replica T
    i
    m
    e

    View Slide

  85. (R)
    (W)
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica
    (A)
    Coordinator Replica T
    i
    m
    e

    View Slide

  86. (R)
    (W)
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica
    (A)
    (S)
    Coordinator Replica T
    i
    m
    e

    View Slide

  87. solving WARS:
    order statistics
    dependent variables
    Instead:
    Monte Carlo methods

    View Slide

  88. to use WARS:
    W
    53.2
    44.5
    101.1
    ...
    A
    10.3
    8.2
    11.3
    ...
    R
    15.3
    22.4
    19.8
    ...
    S
    9.6
    14.2
    6.7
    ...
    run simulation
    Monte Carlo, sampling
    gather latency data

    View Slide

  89. to use WARS:
    W
    53.2
    44.5
    101.1
    ...
    A
    10.3
    8.2
    11.3
    ...
    R
    15.3
    22.4
    19.8
    ...
    S
    9.6
    14.2
    6.7
    ...
    run simulation
    Monte Carlo, sampling
    gather latency data
    44.5

    View Slide

  90. to use WARS:
    W
    53.2
    44.5
    101.1
    ...
    A
    10.3
    8.2
    11.3
    ...
    R
    15.3
    22.4
    19.8
    ...
    S
    9.6
    14.2
    6.7
    ...
    run simulation
    Monte Carlo, sampling
    gather latency data
    44.5
    11.3

    View Slide

  91. to use WARS:
    W
    53.2
    44.5
    101.1
    ...
    A
    10.3
    8.2
    11.3
    ...
    R
    15.3
    22.4
    19.8
    ...
    S
    9.6
    14.2
    6.7
    ...
    run simulation
    Monte Carlo, sampling
    gather latency data
    44.5
    11.3
    15.3

    View Slide

  92. to use WARS:
    W
    53.2
    44.5
    101.1
    ...
    A
    10.3
    8.2
    11.3
    ...
    R
    15.3
    22.4
    19.8
    ...
    S
    9.6
    14.2
    6.7
    ...
    run simulation
    Monte Carlo, sampling
    gather latency data
    44.5
    11.3
    15.3
    14.2

    View Slide

  93. real Cassandra cluster
    varying latencies:
    t-visibility RMSE: 0.28%
    latency N-RMSE: 0.48%
    WARS accuracy

    View Slide

  94. How eventual?
    key: WARS model
    need: latencies
    t-visibility: consistent
    reads with probability p
    after t seconds

    View Slide

  95. intro
    system model
    practice
    metrics
    insights
    integration

    View Slide

  96. Yammer
    100K+ companies
    uses Riak
    LinkedIn
    175M+ users
    built and uses Voldemort
    production latencies
    fit gaussian mixtures

    View Slide

  97. N=3

    View Slide

  98. 10 ms
    N=3

    View Slide

  99. Latency is combined read and write latency at 99.9th percentile
    R=3, W=1
    100% consistent:
    Latency: 15.01 ms
    LNKD-DISK
    N=3
    R=2, W=1, t =13.6 ms
    99.9% consistent:
    Latency: 12.53 ms

    View Slide

  100. Latency is combined read and write latency at 99.9th percentile
    R=3, W=1
    100% consistent:
    Latency: 15.01 ms
    LNKD-DISK
    N=3
    16.5%
    faster
    R=2, W=1, t =13.6 ms
    99.9% consistent:
    Latency: 12.53 ms

    View Slide

  101. Latency is combined read and write latency at 99.9th percentile
    R=3, W=1
    100% consistent:
    Latency: 15.01 ms
    LNKD-DISK
    N=3
    16.5%
    faster
    R=2, W=1, t =13.6 ms
    99.9% consistent:
    Latency: 12.53 ms
    worthwhile?

    View Slide

  102. N=3

    View Slide

  103. N=3

    View Slide

  104. N=3

    View Slide

  105. Latency is combined read and write latency at 99.9th percentile
    R=3, W=1
    100% consistent:
    Latency: 4.20 ms
    LNKD-SSD
    N=3
    R=1, W=1, t = 1.85 ms
    99.9% consistent:
    Latency: 1.32 ms

    View Slide

  106. Latency is combined read and write latency at 99.9th percentile
    R=3, W=1
    100% consistent:
    Latency: 4.20 ms
    LNKD-SSD
    N=3
    59.5%
    faster
    R=1, W=1, t = 1.85 ms
    99.9% consistent:
    Latency: 1.32 ms

    View Slide

  107. 10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    W=3
    10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    CDF
    W=1
    10 2 10 1 100 101 102 103
    Write Latency (ms)
    0.2
    0.4
    0.6
    0.8
    1.0
    W=2
    LNKD-SSD LNKD-DISK
    LNKD-SSD LNKD-DISK
    N=3

    View Slide

  108. 10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    W=3
    10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    CDF
    W=1
    10 2 10 1 100 101 102 103
    Write Latency (ms)
    0.2
    0.4
    0.6
    0.8
    1.0
    W=2
    LNKD-SSD LNKD-DISK
    LNKD-SSD LNKD-DISK
    N=3

    View Slide

  109. 10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    W=3
    10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    CDF
    W=1
    10 2 10 1 100 101 102 103
    Write Latency (ms)
    0.2
    0.4
    0.6
    0.8
    1.0
    W=2
    LNKD-SSD LNKD-DISK
    LNKD-SSD LNKD-DISK
    N=3

    View Slide

  110. Coordinator Replica
    write
    ack
    (A)
    (W)
    response
    (S)
    (R)
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica
    SSDs reduce
    variance
    compared to
    disks!
    read

    View Slide

  111. Yammer
    latency
    81.1%

    (187ms) 202 ms
    t-visibility
    99.9th
    N=3

    View Slide

  112. k-staleness (versions)
    How consistent?
    monotonic reads
    quorum load
    in the paper

    View Slide

  113. in the paper
    -staleness:
    versions and time

    View Slide

  114. latency distributions
    WAN model
    varying quorum sizes
    staleness detection
    in the paper

    View Slide

  115. intro
    system model
    practice
    metrics
    insights
    integration

    View Slide

  116. 1.Tracing
    2. Simulation
    3. Tune N,R,W
    Integration
    Project Voldemort

    View Slide

  117. https://issues.apache.org/jira/browse/CASSANDRA-4261

    View Slide

  118. View Slide

  119. View Slide

  120. Related Work
    Quorum Systems
    • probabilistic quorums [PODC ’97]
    • deterministic k-quorums [DISC ’05, ’06]
    Consistency Verification
    • Golab et al. [PODC ’11]
    • Bermbach and Tai [M4WSOC ’11]
    • Wada et al. [CIDR ’11]
    • Anderson et al. [HotDep ’10]
    • Transactional consistency:
    Zellag and Kemme [ICDE ’11],
    Fekete et al. [VLDB ’09]
    Latency-Consistency
    • Daniel Abadi [Computer ’12]
    • Kraska et al. [VLDB ’09]
    Bounded Staleness
    Guarantees
    • TACT [OSDI ’00]
    • FRACS [ICDCS ’03]
    • AQuA [IEEE TPDS ’03]

    View Slide

  121. R+W
    strong
    consistency
    higher
    latency
    eventual
    consistency
    lower
    latency

    View Slide

  122. consistency
    is a

    View Slide

  123. consistency
    continuum
    is a

    View Slide

  124. consistency
    continuum
    is a
    strong eventual

    View Slide

  125. consistency
    continuum
    is a
    strong eventual

    View Slide

  126. quantify eventual consistency
    model staleness in time, versions
    latency-consistency trade-offs
    analyze real systems and hardware
    PBS

    View Slide

  127. quantify eventual consistency
    model staleness in time, versions
    latency-consistency trade-offs
    analyze real systems and hardware
    PBS
    quantify which choice is best and explain
    why EC is often strongly consistent

    View Slide

  128. quantify eventual consistency
    model staleness in time, versions
    latency-consistency trade-offs
    analyze real systems and hardware
    pbs.cs.berkeley.edu
    PBS
    quantify which choice is best and explain
    why EC is often strongly consistent

    View Slide

  129. Extra Slides

    View Slide

  130. Non-expanding Quorum Systems
    e.g., probabilistic quorums (PODC ’97)
    deterministic k-quorums (DISC ’05, ’06)
    Bounded Staleness Guarantees
    e.g., TACT (OSDI ’00), FRACS (ICDCS ’03)

    View Slide

  131. Consistency Verification
    e.g., Golab et al. (PODC ’11),
    Bermbach and Tai (M4WSOC ’11),
    Wada et al. (CIDR ’11)
    Latency-Consistency
    Daniel Abadi (IEEE Computer ’12)

    View Slide

  132. PBS
    and
    apps

    View Slide

  133. staleness requires
    either:
    staleness-tolerant data structures
    timelines, logs
    cf. commutative data structures
    logical monotonicity
    asynchronous compensation code
    detect violations after data is returned; see paper
    cf. “Building on Quicksand”
    memories, guesses, apologies
    write code to fix any errors

    View Slide

  134. minimize:
    (compensation cost)×(# of expected anomalies)
    asynchronous
    compensation

    View Slide

  135. Read only newer data?
    client’s read rate
    global write rate
    (monotonic reads session guarantee)
    # versions
    tolerable
    staleness
    =
    (for a given key)

    View Slide

  136. Failure?

    View Slide

  137. latency
    spi
    kes
    Treat failures as

    View Slide

  138. How l o n g
    do partitions last?

    View Slide

  139. what time interval?
    99.9% uptime/yr
    㱺 8.76 hours downtime/yr
    8.76 consecutive hours down
    㱺 bad 8-hour rolling average

    View Slide

  140. what time interval?
    99.9% uptime/yr
    㱺 8.76 hours downtime/yr
    8.76 consecutive hours down
    㱺 bad 8-hour rolling average
    hide in tail of distribution OR
    continuously evaluate SLA, adjust

    View Slide

  141. 10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    W=3
    10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    CDF
    W=1
    10 2 10 1 100 101 102 103
    Write Latency (ms)
    0.2
    0.4
    0.6
    0.8
    1.0
    W=2
    -SSD LNKD-DISK YMMR WA
    NKD-DISK YMMR WAN
    KD-SSD LNKD-DISK YMMR W
    LNKD-SSD LNKD-DISK
    N=3

    View Slide

  142. 10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    R=3
    -SSD LNKD-DISK YMMR WA
    NKD-DISK YMMR WAN
    KD-SSD LNKD-DISK YMMR W
    LNKD-SSD LNKD-DISK
    10 2 10 1 100 101 102 103
    0.2
    0.4
    0.6
    0.8
    1.0
    CDF
    W=1
    10 2 10 1 100 101 102 103
    Write Latency (ms)
    0.2
    0.4
    0.6
    0.8
    1.0
    W=2
    (LNKD-SSD and LNKD-DISK identical for reads)
    N=3

    View Slide

  143. -staleness:
    versions and time

    View Slide

  144. -staleness:
    versions and time
    approximation:
    exponentiate
    t-staleness by k

    View Slide

  145. reads return the last
    written value or newer
    (defined w.r.t. real time,
    when the read started)
    consistency
    _
    _
    _
    “strong”

    View Slide

  146. R1
    N = 3 replicas
    R2 R3
    Write to W, read from R replicas

    View Slide

  147. R1
    N = 3 replicas
    R2 R3
    R=W=3 replicas
    { }
    }
    { R1 R2 R3
    R=W=2 replicas
    { }
    R1
    { R2
    } R2
    { R3
    } R1
    { R3
    }
    Write to W, read from R replicas
    quorum system:
    guaranteed
    intersection

    View Slide

  148. R1
    N = 3 replicas
    R2 R3
    R=W=3 replicas
    R=W=1 replicas
    { }
    }
    { R1 R2 R3
    { }
    R1
    }
    { R2
    }
    { R3
    }
    {
    R=W=2 replicas
    { }
    R1
    { R2
    } R2
    { R3
    } R1
    { R3
    }
    Write to W, read from R replicas
    quorum system:
    guaranteed
    intersection
    partial quorum
    system:
    may not intersect

    View Slide

  149. Synthetic,
    Exponential Distributions
    N=3, W=1, R=1

    View Slide

  150. Synthetic,
    Exponential Distributions
    W 1/4x ARS
    N=3, W=1, R=1

    View Slide

  151. Synthetic,
    Exponential Distributions
    W 1/4x ARS
    W 10x ARS
    N=3, W=1, R=1

    View Slide

  152. concurrent writes:
    deterministically choose
    Coordinator R=2
    (“key”, 1) (“key”, 2)

    View Slide

  153. View Slide

  154. View Slide

  155. View Slide

  156. View Slide

  157. N = 3 replicas
    Coordinator
    client
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)

    View Slide

  158. N = 3 replicas
    Coordinator
    client
    read(“key”)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)

    View Slide

  159. N = 3 replicas
    Coordinator
    read(“key”)
    client
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)

    View Slide

  160. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    client
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)

    View Slide

  161. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    client
    (“key”, 1)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)

    View Slide

  162. N = 3 replicas
    Coordinator
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  163. N = 3 replicas
    Coordinator
    read(“key”)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  164. N = 3 replicas
    Coordinator
    read(“key”)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  165. N = 3 replicas
    Coordinator
    (“key”, 1)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  166. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  167. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  168. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    read
    R=3
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  169. N = 3 replicas
    Coordinator
    read
    R=1
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  170. N = 3 replicas
    Coordinator
    read(“key”)
    read
    R=1
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  171. N = 3 replicas
    Coordinator
    read(“key”)
    send
    read
    to all
    read
    R=1
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  172. N = 3 replicas
    Coordinator
    (“key”, 1)
    send
    read
    to all
    read
    R=1
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  173. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    send
    read
    to all
    read
    R=1
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  174. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    send
    read
    to all
    read
    R=1
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  175. N = 3 replicas
    Coordinator
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    (“key”, 1)
    send
    read
    to all
    read
    R=1
    R1 R2 R3
    (“key”, 1) (“key”, 1) (“key”, 1)
    client

    View Slide

  176. Coordinator W=1
    R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

    View Slide

  177. Coordinator
    write(“key”, 2)
    W=1
    R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

    View Slide

  178. Coordinator
    write(“key”, 2)
    W=1
    R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)

    View Slide

  179. Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)

    View Slide

  180. Coordinator
    ack(“key”, 2)
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)

    View Slide

  181. Coordinator Coordinator
    read(“key”)
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    R=1

    View Slide

  182. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    read(“key”)
    R=1

    View Slide

  183. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    (“key”, 1)
    R=1

    View Slide

  184. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    (“key”,1)
    R=1

    View Slide

  185. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    (“key”,1)
    R=1

    View Slide

  186. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    (“key”,1)
    R=1

    View Slide

  187. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    (“key”,1)
    R=1

    View Slide

  188. (“key”, 2)
    Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2 R3(“key”, 1)
    (“key”, 2)
    (“key”,1)
    ack(“key”, 2)
    R=1

    View Slide

  189. (“key”, 2)
    Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2 R3
    (“key”, 2)
    (“key”,1)
    ack(“key”, 2) ack(“key”, 2)
    (“key”, 2)
    R=1

    View Slide

  190. (“key”, 2)
    Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2 R3
    (“key”, 2)
    (“key”,1)
    (“key”, 2)
    R=1

    View Slide

  191. (“key”, 2)
    Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2 R3
    (“key”, 2)
    (“key”,1)
    (“key”, 2)
    (“key”, 2)
    R=1

    View Slide

  192. (“key”, 2)
    Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2 R3
    (“key”, 2)
    (“key”,1)
    (“key”, 2)
    (“key”, 2) (“key”, 2)
    R=1

    View Slide

  193. (“key”, 2)
    Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2 R3
    (“key”, 2)
    (“key”,1)
    (“key”, 2)
    R=1

    View Slide

  194. View Slide

  195. keep replicas in sync

    View Slide

  196. keep replicas in sync

    View Slide

  197. keep replicas in sync

    View Slide

  198. keep replicas in sync

    View Slide

  199. keep replicas in sync

    View Slide

  200. keep replicas in sync

    View Slide

  201. keep replicas in sync

    View Slide

  202. keep replicas in sync
    slow

    View Slide

  203. keep replicas in sync
    slow
    alternative: sync later

    View Slide

  204. keep replicas in sync
    slow
    alternative: sync later

    View Slide

  205. keep replicas in sync
    slow
    alternative: sync later

    View Slide

  206. keep replicas in sync
    slow
    alternative: sync later
    inconsistent

    View Slide

  207. keep replicas in sync
    slow
    alternative: sync later
    inconsistent

    View Slide

  208. keep replicas in sync
    slow
    alternative: sync later
    inconsistent

    View Slide

  209. keep replicas in sync
    slow
    alternative: sync later
    inconsistent

    View Slide

  210. http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/
    "In the general case, we typically
    use [Cassandra’s] consistency level
    of [R=W=1], which provides
    maximum performance. Nice!"

    --D. Williams,

    “HBase vs Cassandra: why we moved”

    February 2010

    View Slide

  211. http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6

    View Slide

  212. http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6

    View Slide

  213. consistent?
    What happens if I don’t wait?
    How

    View Slide

  214. Probability of reading later older than k
    versions is exponentially reduced by k
    Pr(reading latest write) = 99%
    Pr(reading one of last two writes) = 99.9%
    Pr(reading one of last three writes) = 99.99%

    View Slide

  215. cassandra patch
    VLDB 2012 early print
    tinyurl.com/pbsvldb
    tinyurl.com/pbspatch

    View Slide

  216. reads return the last
    written value or newer
    (defined w.r.t. real time,
    when the read started)
    consistency
    _
    _
    _
    “strong”

    View Slide

  217. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    R=1

    View Slide

  218. Coordinator Coordinator
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    (“key”, 1)
    (“key”,1)
    R=1

    View Slide

  219. Coordinator Coordinator
    write(“key”, 2)
    ack(“key”, 2)
    W=1
    R1 R2(“key”, 1) R3(“key”, 1)
    (“key”, 2)
    (“key”, 1)
    (“key”,1)
    R=1
    R3 replied before
    last write arrived!

    View Slide

  220. 99.9% consistent reads:
    R=1, W=1
    t = 1.85 ms
    Latency: 1.32 ms
    Latency is combined read and write latency at 99.9th percentile
    100% consistent reads:
    R=3, W=1
    Latency: 4.20 ms
    LNKD-SSD
    N=3

    View Slide

  221. 99.9% consistent reads:
    R=1, W=1
    t = 1.85 ms
    Latency: 1.32 ms
    Latency is combined read and write latency at 99.9th percentile
    100% consistent reads:
    R=3, W=1
    Latency: 4.20 ms
    LNKD-SSD
    N=3
    59.5%
    faster

    View Slide

  222. 1. Tracing
    2. Simulation
    3. Tune N, R, W
    4. Profit
    Workflow

    View Slide

  223. View Slide

  224. View Slide

  225. 99.9% consistent reads:
    R=1, W=1
    t = 202.0 ms
    Latency: 43.3 ms
    Latency is combined read and write latency at 99.9th percentile
    100% consistent reads:
    R=3, W=1
    Latency: 230.06 ms
    YMMR
    N=3

    View Slide

  226. 99.9% consistent reads:
    R=1, W=1
    t = 202.0 ms
    Latency: 43.3 ms
    Latency is combined read and write latency at 99.9th percentile
    100% consistent reads:
    R=3, W=1
    Latency: 230.06 ms
    YMMR
    N=3
    81.1%
    faster

    View Slide

  227. R+W

    View Slide

  228. N=3

    View Slide

  229. N=3

    View Slide

  230. N=3

    View Slide

  231. focus on
    with failures:
    steady state
    unavailable
    or sloppy

    View Slide

  232. R1
    N = 3 replicas
    R2 R3
    Write to W, read from R replicas

    View Slide

  233. R1
    N = 3 replicas
    R2 R3
    R=W=3 replicas
    { }
    }
    { R1 R2 R3
    R=W=2 replicas
    { }
    R1
    { R2
    } R2
    { R3
    } R1
    { R3
    }
    Write to W, read from R replicas
    quorum system:
    guaranteed
    intersection

    View Slide

  234. R1
    N = 3 replicas
    R2 R3
    R=W=3 replicas
    R=W=1 replicas
    { }
    }
    { R1 R2 R3
    { }
    R1
    }
    { R2
    }
    { R3
    }
    {
    R=W=2 replicas
    { }
    R1
    { R2
    } R2
    { R3
    } R1
    { R3
    }
    Write to W, read from R replicas
    quorum system:
    guaranteed
    intersection
    partial quorum
    system:
    may not intersect

    View Slide

  235. Coordinator Replica
    once per replica T
    i
    m
    e

    View Slide

  236. Coordinator Replica
    write
    once per replica T
    i
    m
    e

    View Slide

  237. Coordinator Replica
    write
    ack
    once per replica T
    i
    m
    e

    View Slide

  238. Coordinator Replica
    write
    ack
    wait for W
    responses
    once per replica T
    i
    m
    e

    View Slide

  239. Coordinator Replica
    write
    ack
    wait for W
    responses
    t seconds elapse
    once per replica T
    i
    m
    e

    View Slide

  240. Coordinator Replica
    write
    ack
    read
    wait for W
    responses
    t seconds elapse
    once per replica T
    i
    m
    e

    View Slide

  241. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    once per replica T
    i
    m
    e

    View Slide

  242. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    once per replica T
    i
    m
    e

    View Slide

  243. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    once per replica T
    i
    m
    e

    View Slide

  244. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    once per replica T
    i
    m
    e

    View Slide

  245. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica T
    i
    m
    e

    View Slide

  246. Coordinator Replica
    write
    ack
    read
    response
    wait for W
    responses
    t seconds elapse
    wait for R
    responses
    response is
    stale
    if read arrives
    before write
    once per replica T
    i
    m
    e

    View Slide

  247. N=2
    T
    i
    m
    e

    View Slide

  248. write
    write
    N=2
    T
    i
    m
    e

    View Slide

  249. write
    ack
    write
    ack
    N=2
    T
    i
    m
    e

    View Slide

  250. write
    ack
    write
    ack
    W=1
    N=2
    T
    i
    m
    e

    View Slide

  251. write
    ack
    write
    ack
    W=1
    N=2
    T
    i
    m
    e

    View Slide

  252. write
    ack
    read
    write
    ack
    W=1
    N=2
    read
    T
    i
    m
    e

    View Slide

  253. write
    ack
    read
    response
    write
    ack
    W=1
    N=2
    read
    response
    T
    i
    m
    e

    View Slide

  254. write
    ack
    read
    response
    write
    ack
    W=1
    R=1
    N=2
    read
    response
    T
    i
    m
    e

    View Slide

  255. write
    ack
    read
    response
    write
    ack
    W=1
    R=1
    N=2
    read
    response
    T
    i
    m
    e

    View Slide

  256. write
    ack
    read
    response
    write
    ack
    W=1
    R=1
    N=2
    read
    response
    T
    i
    m
    e
    inconsistent

    View Slide

  257. N=3
    R=W=2
    quorum
    system

    View Slide

  258. N=3
    R=W=2
    quorum
    system

    View Slide

  259. N=3
    R=W=2
    quorum
    system

    View Slide

  260. Y N=3
    R=W=2
    quorum
    system

    View Slide

  261. Y N=3
    R=W=2
    quorum
    system

    View Slide

  262. Y N=3
    R=W=2
    quorum
    system

    View Slide

  263. Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  264. Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  265. Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  266. Y
    Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  267. Y
    Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  268. Y
    Y
    Y
    Y
    Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  269. Y
    Y
    Y
    Y
    Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  270. Y
    Y
    Y
    Y
    Y
    Y
    Y
    Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  271. Y
    Y
    Y
    Y
    Y
    Y
    Y
    Y
    Y
    N=3
    R=W=2
    quorum
    system

    View Slide

  272. Y
    Y
    Y
    Y
    Y
    Y
    Y
    Y
    Y
    guaranteed
    intersection
    N=3
    R=W=2
    quorum
    system

    View Slide

  273. N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  274. N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  275. N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  276. Y N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  277. Y N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  278. Y N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  279. Y
    N
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  280. Y
    N
    N
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  281. Y
    N
    N
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  282. Y
    N
    N
    N
    Y
    N
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  283. Y
    N
    N
    N
    Y
    N
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  284. Y
    N
    N
    N
    Y
    N
    N
    N
    Y
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  285. Y
    N
    N
    N
    Y
    N
    N
    N
    Y
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  286. Y
    N
    N
    N
    Y
    N
    N
    N
    Y
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  287. Y
    N
    N
    N
    Y
    N
    N
    N
    Y
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  288. Y
    N
    N
    N
    Y
    N
    N
    N
    Y
    probabilistic
    intersection
    N=3
    R=W=1
    partial
    quorum
    system

    View Slide

  289. N
    N
    Y N=3
    R=W=1

    View Slide

  290. N
    N
    Y
    expanding
    quorums
    grow over time
    N=3
    R=W=1

    View Slide

  291. N
    Y
    Y
    expanding
    quorums
    grow over time
    N=3
    R=W=1

    View Slide

  292. Y
    Y
    Y
    expanding
    quorums
    grow over time
    N=3
    R=W=1

    View Slide

  293. View Slide

  294. Werner Vogels

    View Slide

  295. 1994-2004
    Werner Vogels

    View Slide

  296. 1994-2004
    2004-
    Werner Vogels

    View Slide

  297. N=3, R=W=2
    quorum
    system

    View Slide

  298. N=3, R=W=2
    quorum
    system

    View Slide

  299. N=3, R=W=2
    quorum
    system

    View Slide

  300. N=3, R=W=2
    quorum
    system

    View Slide

  301. N=3, R=W=2
    quorum
    system

    View Slide

  302. N=3, R=W=2
    quorum
    system

    View Slide

  303. N=3, R=W=2
    quorum
    system

    View Slide

  304. N=3, R=W=2
    quorum
    system

    View Slide

  305. N=3, R=W=2
    quorum
    system

    View Slide

  306. N=3, R=W=2
    quorum
    system

    View Slide

  307. guaranteed
    intersection
    N=3, R=W=2
    quorum
    system

    View Slide

  308. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  309. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  310. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  311. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  312. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  313. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  314. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  315. N=3, R=W=1
    partial
    quorum
    system

    View Slide

  316. N=3, R=W=1
    partial
    quorum
    system
    probabilistic
    intersection

    View Slide

  317. expanding
    quorums
    N=3, R=W=1
    grow over
    time

    View Slide

  318. expanding
    quorums
    N=3, R=W=1
    grow over
    time

    View Slide

  319. expanding
    quorums
    N=3, R=W=1
    grow over
    time

    View Slide

  320. expanding
    quorums
    N=3, R=W=1
    grow over
    time

    View Slide

  321. expanding
    quorums
    N=3, R=W=1
    grow over
    time

    View Slide

  322. Solving WARS: hard
    Monte Carlo methods: easier

    View Slide

  323. remedy:
    observation:
    technique:
    no guarantees with
    eventual consistency
    consistency prediction
    measure latencies
    use WARS model
    PBS

    View Slide

  324. PBS
    allows us to quantify
    latency-consistency
    trade-offs
    what’s the latency cost of consistency?
    what’s the consistency cost of latency?

    View Slide

  325. PBS
    allows us to quantify
    latency-consistency
    trade-offs
    what’s the latency cost of consistency?
    what’s the consistency cost of latency?
    an “SLA” for consistency

    View Slide