$30 off During Our Annual Pro Sale. View Details »

Highly Available Transactions: Virtues and Limitations

pbailis
September 04, 2014

Highly Available Transactions: Virtues and Limitations

pbailis

September 04, 2014
Tweet

More Decks by pbailis

Other Decks in Technology

Transcript

  1. Highly
    Available
    Transactions
    Peter Bailis
    Aaron Davidson
    Alan Fekete
    Ali Ghodsi
    Joe Hellerstein
    Ion Stoica
    UC Berkeley &
    University of Sydney
    Virtues and Limitations
    VLDB 2015, Hangzhou, China, 4 Sept. 2014

    View Slide

  2. July 2000:
    CAP
    Theorem

    View Slide

  3. High Availability

    View Slide

  4. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  5. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  6. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  7. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  8. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  9. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  10. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  11. High Availability
    [Gilbert and Lynch, ACM SIGACT News 2002]
    System guarantees a response, even during
    network partitions (async network)

    View Slide

  12. network partitions

    View Slide

  13. NETWORK PARTITIONS

    View Slide

  14. “Network partitions should be rare but net gear
    continues to cause more issues than it should.”
    --James Hamilton, Amazon Web Services
    [perspectives.mvdirona.com, 2010]
    NETWORK PARTITIONS

    View Slide

  15. MSFT LAN: avg. 40.8 failures/day (95th %ile: 136)
    5 min median time to repair (up to 1 week)
    [SIGCOMM 2011]
    UC WAN: avg. 16.2–302.0 failures/link/year
    avg. downtime of 24–497 minutes/link/year
    [SIGCOMM 2011]
    HP LAN: 67.1% of support tickets are due to network
    median incident duration 114-188 min
    [HP Labs 2012]
    “Network partitions should be rare but net gear
    continues to cause more issues than it should.”
    --James Hamilton, Amazon Web Services
    [perspectives.mvdirona.com, 2010]
    NETWORK PARTITIONS

    View Slide

  16. “THE NETWORK IS RELIABLE” tops Peter Deutsch’s
    classic list of “Eight fallacies of distributed
    computing,” all [of which] “prove to be false in the
    long run and all [of which] cause big trouble and
    painful learning experiences” (https://blogs.oracle.
    com/jag/resource/Fallacies.html). Accounting for and
    understanding the implications of network behavior
    is key to designing robust distributed programs—
    in fact, six of Deutsch’s “fallacies” directly pertain
    to limitations on networked communications.
    This should be unsurprising: the ability (and often
    requirement) to communicate over a shared channel
    possibility and impossibility of perform-
    ing distributed computations under
    particular sets of network conditions.
    For example, the celebrated FLP
    impossibility result9 demonstrates
    the inability to guarantee consensus
    in an asynchronous network (that is,
    one facing indefinite communication
    partitions between processes) with one
    faulty process. This means that, in the
    presence of unreliable (untimely) mes-
    sage delivery, basic operations such
    as modifying the set of machines in
    a cluster (that is, maintaining group
    membership, as systems such as Zoo-
    keeper are tasked with today) are not
    guaranteed to complete in the event
    of both network asynchrony and indi-
    vidual server failures. Related results
    describe the inability to guarantee the
    progress of serializable transactions,7
    linearizable reads/writes,11 and a variety
    of useful, programmer-friendly guar-
    antees under adverse conditions.3 The
    implications of these results are not
    simply academic: these impossibility
    results have motivated a proliferation
    of systems and designs offering a range
    of alternative guarantees in the event
    of network failures.5 However, under a
    friendlier, more reliable network that
    guarantees timely message delivery,
    FLP and many of these related results
    no longer hold:8 by making stronger
    guarantees about network behavior,
    we can circumvent the programmabil-
    ity implications of these impossibility
    proofs.
    Therefore, the degree of reliability
    in deployment environments is critical
    in robust systems design and directly
    determines the kinds of operations
    that systems can reliably perform with-
    out waiting. Unfortunately, the degree
    to which networks are actually reliable
    in the real world is the subject of con-
    siderable and evolving debate. Some
    have claimed that networks are reliable
    (or that partitions are rare enough in
    practice) and that we are too concerned
    with designing for theoretical failure
    The
    Network
    Is Reliable
    DOI:10.1145/2643130
    Article development led by
    queue.acm.org
    An informal survey of real-world
    communications failures.
    BY PETER BAILIS AND KYLE KINGSBURY
    CACM,
    September
    2014
    issue

    View Slide

  17. High Availability
    System guarantees a response, even during
    network partitions (async network)
    [Gilbert and Lynch, ACM SIGACT News 2002]

    View Slide

  18. High Availability
    System guarantees a response, even during
    network partitions (async network)
    [Gilbert and Lynch, ACM SIGACT News 2002]
    [“PACELC,” Abadi, IEEE Computer 2012]
    Corollary: low latency, especially over WAN

    View Slide

  19. low latency

    View Slide

  20. LAN 0.5ms 1x
    Co-located
    WAN
    1-3.5ms 2-7x
    WAN 22-360ms 44-720x
    average latency from 1 week on ec2
    http://www.bailis.org/blog/communication-costs-in-real-world-networks/
    LOW LATENCY

    View Slide

  21. LAN 0.5ms 1x
    Co-located
    WAN
    1-3.5ms 2-7x
    WAN 22-360ms 44-720x
    average latency from 1 week on ec2
    http://www.bailis.org/blog/communication-costs-in-real-world-networks/
    LOW LATENCY

    View Slide

  22. LAN 0.5ms 1x
    Co-located
    WAN
    1-3.5ms 2-7x
    WAN 22-360ms 44-720x
    average latency from 1 week on ec2
    http://www.bailis.org/blog/communication-costs-in-real-world-networks/
    LOW LATENCY

    View Slide

  23. LAN 0.5ms 1x
    Co-located
    WAN
    1-3.5ms 2-7x
    WAN 22-360ms 44-720x
    average latency from 1 week on ec2
    http://www.bailis.org/blog/communication-costs-in-real-world-networks/
    LOW LATENCY

    View Slide

  24. LAN 0.5ms 1x
    Co-located
    WAN
    1-3.5ms 2-7x
    WAN 22-360ms 44-720x
    average latency from 1 week on ec2
    http://www.bailis.org/blog/communication-costs-in-real-world-networks/
    LOW LATENCY

    View Slide

  25. THOSE LIGHT CONES_

    View Slide

  26. July 2000:
    CAP
    Theorem

    View Slide

  27. “BQ”!jt!gvoebnfoubmmz!bcpvu!

    View Slide

  28. “BQ”!jt!gvoebnfoubmmz!bcpvu!
    avoiding coordination

    View Slide

  29. “BQ”!jt!gvoebnfoubmmz!bcpvu!
    Availability
    avoiding coordination

    View Slide

  30. “BQ”!jt!gvoebnfoubmmz!bcpvu!
    Availability
    Low Latency
    avoiding coordination

    View Slide

  31. “BQ”!jt!gvoebnfoubmmz!bcpvu!
    Availability
    Low Latency
    avoiding coordination

    View Slide

  32. “BQ”!jt!gvoebnfoubmmz!bcpvu!
    Availability
    Low Latency
    High Throughput
    avoiding coordination

    View Slide

  33. “BQ”!jt!gvoebnfoubmmz!bcpvu!
    Availability
    Low Latency
    High Throughput
    Aggressive Scale-out
    avoiding coordination

    View Slide

  34. “BQ”!jt!gvoebnfoubmmz!bcpvu!
    Availability
    Low Latency
    High Throughput
    Aggressive Scale-out
    cf. “Coordination Avoidance in Database Systems”
    to appear in VLDB 2015
    avoiding coordination

    View Slide

  35. View Slide

  36. CONSISTENCY

    vs
    COORDINATION

    View Slide

  37. CONSISTENCY

    vs
    AVAILABILITY

    View Slide

  38. CONSISTENCY

    vs
    AVAILABILITY
    Linearizability
    “Atomic”
    C in CAP

    View Slide

  39. CONSISTENCY

    vs
    AVAILABILITY
    Eventual
    Linearizability
    “Atomic”
    C in CAP

    View Slide

  40. CONSISTENCY

    vs
    AVAILABILITY
    Eventual
    Linearizability
    “Atomic”
    C in CAP

    View Slide

  41. OpTRM!

    View Slide

  42. Strong
    consistency
    is expensive;
    avoid whenever possible!
    OpTRM!

    View Slide

  43. CAP implies
    transactions
    are unavailable
    Common
    (mis)conception:
    Strong
    consistency
    is expensive;
    avoid whenever possible!
    OpTRM!

    View Slide

  44. View Slide

  45. CAP is about
    linearizability
    CAP doesn’t
    mention transactions

    View Slide

  46. Was
    the NoSQL
    movement
    right?

    View Slide

  47. Was
    the NoSQL
    movement
    right?
    Are all
    transactions
    unavailable?

    View Slide

  48. serializability
    Is achievable with
    HA?

    View Slide

  49. serializability
    Is achievable with
    HA?

    View Slide

  50. serializability
    Is achievable with
    HA?

    View Slide

  51. View Slide

  52. View Slide

  53. Serializability is expensive

    View Slide

  54. !
    Use weaker models
    instead
    Serializability is expensive

    View Slide

  55. HANA

    View Slide

  56. ep!opu!tvqqpsu!
    tfsjbmj{bcjmjuz
    HANA

    View Slide

  57. ep!opu!tvqqpsu!
    tfsjbmj{bcjmjuz
    HANA
    Actian Ingres YES
    Aerospike NO
    N
    Persistit NO
    N
    Clustrix NO
    N
    Greenplum YES
    IBM DB2 YES
    IBM Informix YES
    MySQL YES
    MemSQL NO
    N
    MS SQL Server YES
    NuoDB NO
    N
    Oracle 11G NO
    N
    Oracle BDB YES
    Oracle BDB JE YES
    Postgres 9.2.2 YES
    SAP Hana NO
    N
    ScaleDB NO
    N
    VoltDB YES
    8/18 databases
    surveyed did not
    15/18 used
    weak models
    by default
    Serializability supported?

    View Slide

  58. serializability

    View Slide

  59. serializability

    View Slide

  60. serializability
    snapshot isolation
    read committed
    repeatable read
    cursor stability
    read uncommitted
    monotonic view
    update
    serializability

    View Slide

  61. serializability
    snapshot isolation
    read committed
    repeatable read
    cursor stability
    read uncommitted
    monotonic view
    update
    serializability

    View Slide

  62. serializability
    snapshot isolation
    read committed
    repeatable read
    cursor stability
    read uncommitted
    monotonic view
    update
    serializability

    View Slide

  63. serializability
    snapshot isolation
    read committed
    repeatable read
    cursor stability
    read uncommitted
    monotonic view
    update
    serializability
    HA?
    HA? HA?
    HA?
    HA?
    HA? HA?

    View Slide

  64. serializability
    snapshot isolation
    read committed
    repeatable read
    cursor stability
    read uncommitted
    monotonic view
    update
    serializability
    HA?
    HA? HA?
    HA?
    HA?
    HA? HA?
    Highly Available Transactions

    View Slide

  65. serializability
    snapshot isolation
    read committed
    repeatable read
    cursor stability
    read uncommitted
    monotonic view
    update
    serializability
    HA?
    HA? HA?
    HA?
    HA?
    HA? HA?
    HATs

    View Slide

  66. [Atul Adya Ph.D. Thesis, MIT 2000]

    View Slide

  67. View Slide

  68. Challenge: traditional implementations
    are unavailable

    View Slide

  69. Challenge: traditional implementations
    are unavailable

    View Slide

  70. Unavailable
    Sticky Available
    Highly Available
    Legend
    prevents lost update†, prevents write skew‡,
    requires recency guarantees⊕
    Sticky Available
    Unavailable
    Highly
    Available

    View Slide

  71. View Slide

  72. Fyjtujoh!
    Ebubcbtf!
    Jtpmbujpo

    View Slide

  73. Fyjtujoh!
    Ebubcbtf!
    Jtpmbujpo
    Ejtusjcvufe!
    Sfhjtufst!

    View Slide

  74. Fyjtujoh!
    Ebubcbtf!
    Jtpmbujpo
    Tfttjpo!Hvbsboufft Ejtusjcvufe!
    Sfhjtufst!

    View Slide

  75. Unavailable
    Sticky Available
    Highly Available
    Legend
    prevents lost update†, prevents write skew‡,
    requires recency guarantees⊕
    Sticky Available
    Unavailable
    Highly
    Available

    View Slide

  76. Unavailable
    Sticky Available
    Highly Available
    Legend
    prevents lost update†, prevents write skew‡,
    requires recency guarantees⊕
    Sticky Available
    Unavailable
    Highly
    Available

    View Slide

  77. Unavailable
    Sticky Available
    Highly Available
    Legend
    prevents lost update†, prevents write skew‡,
    requires recency guarantees⊕
    Sticky Available
    Unavailable
    Highly
    Available

    View Slide

  78. View Slide

  79. Read Committed (RC)

    View Slide

  80. Read Committed (RC)
    Replicas never serve dirty or non-final writes

    View Slide

  81. Read Committed (RC)
    Transactions buffer writes
    until commit time
    Replicas never serve dirty or non-final writes

    View Slide

  82. Read Committed (RC)
    Transactions buffer writes
    until commit time
    Replicas never serve dirty or non-final writes

    View Slide

  83. Read Committed (RC)
    Transactions buffer writes
    until commit time
    Replicas never serve dirty or non-final writes
    ANSI Repeatable Read (RR)

    View Slide

  84. Read Committed (RC)
    Transactions buffer writes
    until commit time
    Replicas never serve dirty or non-final writes
    ANSI Repeatable Read (RR)
    Transactions read from a snapshot of DB

    View Slide

  85. Read Committed (RC)
    Transactions buffer writes
    until commit time
    Replicas never serve dirty or non-final writes
    ANSI Repeatable Read (RR)
    Transactions buffer reads
    from replicas
    Transactions read from a snapshot of DB

    View Slide

  86. Read Committed (RC)
    Transactions buffer writes
    until commit time
    Replicas never serve dirty or non-final writes
    ANSI Repeatable Read (RR)
    Transactions buffer reads
    from replicas
    Transactions read from a snapshot of DB

    View Slide

  87. Read Committed (RC)
    Transactions buffer writes
    until commit time
    Replicas never serve dirty or non-final writes
    ANSI Repeatable Read (RR)
    Transactions buffer reads
    from replicas
    Transactions read from a snapshot of DB
    Unavailable implementations
    ⇏ unavailable semantics

    View Slide

  88. Unavailable
    Sticky Available
    Highly Available
    Legend
    prevents lost update†, prevents write skew‡,
    requires recency guarantees⊕
    Sticky Available
    Unavailable
    Highly
    Available

    View Slide

  89. Snapshot reads of database state
    (database does not change)
    including predicate-based reads
    + ANSI Repeatable Read
    Read Atomic Isolation (+TA)
    Read your writes
    Time doesn’t go backwards
    Writes follow reads
    + Causal Consistency
    Observe all or none of another txn’s updates

    View Slide

  90. https://github.com/pbailis/hat-vldb2014-code
    Experimental Validation
    Thrift
    -
    based sharded key
    -
    value store
    with LevelDB for persistence
    Focus on “CP” vs. HAT overheads
    cluster A cluster B

    View Slide

  91. 2 clusters in us-east
    5 servers/cluster
    transactions of length 8
    50% reads, 50% writes

    View Slide

  92. 2 clusters in us-east
    5 servers/cluster
    transactions of length 8
    50% reads, 50% writes
    0 200 400 600 800 1000
    0
    20
    40
    60
    80
    100
    120
    Avg. Latency (ms)
    Eventual RC TA Master

    View Slide

  93. 0 200 400 600 800 1000
    0
    20
    40
    60
    80
    100
    120
    Avg. Latency (ms)
    Eventual RC TA Master
    2 clusters in us-east
    5 servers/cluster
    transactions of length 8
    50% reads, 50% writes

    View Slide

  94. 0 200 400 600 800 1000
    0
    20
    40
    60
    80
    100
    120
    Avg. Latency (ms)
    Eventual RC TA Master
    Mastered 2x
    latency of HATs
    2 clusters in us-east
    5 servers/cluster
    transactions of length 8
    50% reads, 50% writes

    View Slide

  95. 0 200 400 600 800 1000
    0
    20
    40
    60
    80
    100
    120
    Avg. Latency (ms)
    Eventual RC TA Master
    Mastered 2x
    latency of HATs
    2 clusters in us-east
    5 servers/cluster
    transactions of length 8
    50% reads, 50% writes
    128K ops/s

    View Slide

  96. clusters in us-east, us-west
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  97. 0 200 400 600 800 1000
    Avg. Latency (ms)
    Eventual RC TA Master
    clusters in us-east, us-west
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  98. 0 200 400 600 800 1000
    Avg. Latency (ms)
    Eventual RC TA Master
    300ms
    clusters in us-east, us-west
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  99. 0 200 400 600 800 1000
    Avg. Latency (ms)
    Eventual RC TA Master
    Mastered 2-70x
    latency of HATs
    300ms
    clusters in us-east, us-west
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  100. CA, VA, OR, Ireland, Singapore
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  101. 0 500 1000 1500 2000 2500
    Avg. Latency (ms)
    Eventual RC TA Master
    CA, VA, OR, Ireland, Singapore
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  102. 0 500 1000 1500 2000 2500
    Avg. Latency (ms)
    Eventual RC TA Master
    800ms
    CA, VA, OR, Ireland, Singapore
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  103. 0 500 1000 1500 2000 2500
    Avg. Latency (ms)
    Eventual RC TA Master
    800ms Mastered 8-186x
    latency of HATs
    CA, VA, OR, Ireland, Singapore
    5 servers/DC
    transactions of length 8
    50% reads, 50% writes

    View Slide

  104. Also in paper
    In-depth discussion of isolation guarantees
    Extending “AP” to transactional context
    Sticky availability and sessions
    Discussion of atomicity and durability
    More evaluation

    View Slide

  105. This paper:
    All about coordination + isolation levels
    (Some surprising results!)
    How else can databases benefit?
    How do we address whole programs?
    Our experience: isolation levels are unintuitive!

    View Slide

  106. RAMP Transactions: new isolation model and
    coordination-free implementation of indexing,
    matviews, multi-put [SIGMOD14]
    I-confluence: which integrity constraints are
    enforceable without coordination? OLTPBench
    suite plus general theory [VLDB 2015]
    Real-world applications: analysis of open-
    source applications for coordination
    requirements; similar results [In preparation]
    Distributed optimization: Numerical convex programs have close
    analogues to transaction-processing techniques [In preparation]

    View Slide

  107. PUNCHLINE:
    Coordination is avoidable surprisingly often
    Need to understand use cases + semantics
    The use cases are staggering in number
    and often in plain sight
    Hint: look to applications, big systems in the wild
    We have a huge opportunity to improve theory
    and practice by understanding what’s possible

    View Slide