Highly Available Transactions: Virtues and Limitations

B7dc26518988058faa50712248c80bd3?s=47 pbailis
September 04, 2014

Highly Available Transactions: Virtues and Limitations

B7dc26518988058faa50712248c80bd3?s=128

pbailis

September 04, 2014
Tweet

Transcript

  1. Highly Available Transactions Peter Bailis Aaron Davidson Alan Fekete Ali

    Ghodsi Joe Hellerstein Ion Stoica UC Berkeley & University of Sydney Virtues and Limitations VLDB 2015, Hangzhou, China, 4 Sept. 2014
  2. July 2000: CAP Theorem

  3. High Availability

  4. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  5. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  6. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  7. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  8. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  9. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  10. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  11. High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System

    guarantees a response, even during network partitions (async network)
  12. network partitions

  13. NETWORK PARTITIONS

  14. “Network partitions should be rare but net gear continues to

    cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS
  15. MSFT LAN: avg. 40.8 failures/day (95th %ile: 136) 5 min

    median time to repair (up to 1 week) [SIGCOMM 2011] UC WAN: avg. 16.2–302.0 failures/link/year avg. downtime of 24–497 minutes/link/year [SIGCOMM 2011] HP LAN: 67.1% of support tickets are due to network median incident duration 114-188 min [HP Labs 2012] “Network partitions should be rare but net gear continues to cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS
  16. “THE NETWORK IS RELIABLE” tops Peter Deutsch’s classic list of

    “Eight fallacies of distributed computing,” all [of which] “prove to be false in the long run and all [of which] cause big trouble and painful learning experiences” (https://blogs.oracle. com/jag/resource/Fallacies.html). Accounting for and understanding the implications of network behavior is key to designing robust distributed programs— in fact, six of Deutsch’s “fallacies” directly pertain to limitations on networked communications. This should be unsurprising: the ability (and often requirement) to communicate over a shared channel possibility and impossibility of perform- ing distributed computations under particular sets of network conditions. For example, the celebrated FLP impossibility result9 demonstrates the inability to guarantee consensus in an asynchronous network (that is, one facing indefinite communication partitions between processes) with one faulty process. This means that, in the presence of unreliable (untimely) mes- sage delivery, basic operations such as modifying the set of machines in a cluster (that is, maintaining group membership, as systems such as Zoo- keeper are tasked with today) are not guaranteed to complete in the event of both network asynchrony and indi- vidual server failures. Related results describe the inability to guarantee the progress of serializable transactions,7 linearizable reads/writes,11 and a variety of useful, programmer-friendly guar- antees under adverse conditions.3 The implications of these results are not simply academic: these impossibility results have motivated a proliferation of systems and designs offering a range of alternative guarantees in the event of network failures.5 However, under a friendlier, more reliable network that guarantees timely message delivery, FLP and many of these related results no longer hold:8 by making stronger guarantees about network behavior, we can circumvent the programmabil- ity implications of these impossibility proofs. Therefore, the degree of reliability in deployment environments is critical in robust systems design and directly determines the kinds of operations that systems can reliably perform with- out waiting. Unfortunately, the degree to which networks are actually reliable in the real world is the subject of con- siderable and evolving debate. Some have claimed that networks are reliable (or that partitions are rare enough in practice) and that we are too concerned with designing for theoretical failure The Network Is Reliable DOI:10.1145/2643130 Article development led by queue.acm.org An informal survey of real-world communications failures. BY PETER BAILIS AND KYLE KINGSBURY CACM, September 2014 issue
  17. High Availability System guarantees a response, even during network partitions

    (async network) [Gilbert and Lynch, ACM SIGACT News 2002]
  18. High Availability System guarantees a response, even during network partitions

    (async network) [Gilbert and Lynch, ACM SIGACT News 2002] [“PACELC,” Abadi, IEEE Computer 2012] Corollary: low latency, especially over WAN
  19. low latency

  20. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  21. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  22. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  23. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  24. LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x

    average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
  25. THOSE LIGHT CONES_

  26. July 2000: CAP Theorem

  27. “BQ”!jt!gvoebnfoubmmz!bcpvu!

  28. “BQ”!jt!gvoebnfoubmmz!bcpvu! avoiding coordination

  29. “BQ”!jt!gvoebnfoubmmz!bcpvu! Availability avoiding coordination

  30. “BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency avoiding coordination

  31. “BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency avoiding coordination

  32. “BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput avoiding coordination

  33. “BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput Aggressive Scale-out avoiding coordination

  34. “BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput Aggressive Scale-out cf. “Coordination

    Avoidance in Database Systems” to appear in VLDB 2015 avoiding coordination
  35. None
  36. CONSISTENCY
 vs COORDINATION

  37. CONSISTENCY
 vs AVAILABILITY

  38. CONSISTENCY
 vs AVAILABILITY Linearizability “Atomic” C in CAP

  39. CONSISTENCY
 vs AVAILABILITY Eventual Linearizability “Atomic” C in CAP

  40. CONSISTENCY
 vs AVAILABILITY Eventual Linearizability “Atomic” C in CAP

  41. OpTRM!

  42. Strong consistency is expensive; avoid whenever possible! OpTRM!

  43. CAP implies transactions are unavailable Common (mis)conception: Strong consistency is

    expensive; avoid whenever possible! OpTRM!
  44. None
  45. CAP is about linearizability CAP doesn’t mention transactions

  46. Was the NoSQL movement right?

  47. Was the NoSQL movement right? Are all transactions unavailable?

  48. serializability Is achievable with HA?

  49. serializability Is achievable with HA?

  50. serializability Is achievable with HA?

  51. None
  52. None
  53. Serializability is expensive

  54. ! Use weaker models instead Serializability is expensive

  55. HANA

  56. ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA

  57. ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA Actian Ingres YES Aerospike NO N Persistit

    NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weak models by default Serializability supported?
  58. serializability

  59. serializability

  60. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability
  61. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability
  62. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability
  63. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA?
  64. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? Highly Available Transactions
  65. serializability snapshot isolation read committed repeatable read cursor stability read

    uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? HATs
  66. [Atul Adya Ph.D. Thesis, MIT 2000]

  67. None
  68. Challenge: traditional implementations are unavailable

  69. Challenge: traditional implementations are unavailable

  70. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  71. None
  72. Fyjtujoh! Ebubcbtf! Jtpmbujpo

  73. Fyjtujoh! Ebubcbtf! Jtpmbujpo Ejtusjcvufe! Sfhjtufst!

  74. Fyjtujoh! Ebubcbtf! Jtpmbujpo Tfttjpo!Hvbsboufft Ejtusjcvufe! Sfhjtufst!

  75. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  76. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  77. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  78. None
  79. Read Committed (RC)

  80. Read Committed (RC) Replicas never serve dirty or non-final writes

  81. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes
  82. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes
  83. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR)
  84. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions read from a snapshot of DB
  85. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB
  86. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB
  87. Read Committed (RC) Transactions buffer writes until commit time Replicas

    never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB Unavailable implementations ⇏ unavailable semantics
  88. Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents

    write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
  89. Snapshot reads of database state (database does not change) including

    predicate-based reads + ANSI Repeatable Read Read Atomic Isolation (+TA) Read your writes Time doesn’t go backwards Writes follow reads + Causal Consistency Observe all or none of another txn’s updates
  90. https://github.com/pbailis/hat-vldb2014-code Experimental Validation Thrift - based sharded key - value

    store with LevelDB for persistence Focus on “CP” vs. HAT overheads cluster A cluster B
  91. 2 clusters in us-east 5 servers/cluster transactions of length 8

    50% reads, 50% writes
  92. 2 clusters in us-east 5 servers/cluster transactions of length 8

    50% reads, 50% writes 0 200 400 600 800 1000 0 20 40 60 80 100 120 Avg. Latency (ms) Eventual RC TA Master
  93. 0 200 400 600 800 1000 0 20 40 60

    80 100 120 Avg. Latency (ms) Eventual RC TA Master 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes
  94. 0 200 400 600 800 1000 0 20 40 60

    80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes
  95. 0 200 400 600 800 1000 0 20 40 60

    80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes 128K ops/s
  96. clusters in us-east, us-west 5 servers/DC transactions of length 8

    50% reads, 50% writes
  97. 0 200 400 600 800 1000 Avg. Latency (ms) Eventual

    RC TA Master clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes
  98. 0 200 400 600 800 1000 Avg. Latency (ms) Eventual

    RC TA Master 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes
  99. 0 200 400 600 800 1000 Avg. Latency (ms) Eventual

    RC TA Master Mastered 2-70x latency of HATs 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes
  100. CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length

    8 50% reads, 50% writes
  101. 0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual

    RC TA Master CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes
  102. 0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual

    RC TA Master 800ms CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes
  103. 0 500 1000 1500 2000 2500 Avg. Latency (ms) Eventual

    RC TA Master 800ms Mastered 8-186x latency of HATs CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes
  104. Also in paper In-depth discussion of isolation guarantees Extending “AP”

    to transactional context Sticky availability and sessions Discussion of atomicity and durability More evaluation
  105. This paper: All about coordination + isolation levels (Some surprising

    results!) How else can databases benefit? How do we address whole programs? Our experience: isolation levels are unintuitive!
  106. RAMP Transactions: new isolation model and coordination-free implementation of indexing,

    matviews, multi-put [SIGMOD14] I-confluence: which integrity constraints are enforceable without coordination? OLTPBench suite plus general theory [VLDB 2015] Real-world applications: analysis of open- source applications for coordination requirements; similar results [In preparation] Distributed optimization: Numerical convex programs have close analogues to transaction-processing techniques [In preparation]
  107. PUNCHLINE: Coordination is avoidable surprisingly often Need to understand use

    cases + semantics The use cases are staggering in number and often in plain sight Hint: look to applications, big systems in the wild We have a huge opportunity to improve theory and practice by understanding what’s possible