Highly Available Transactions Peter Bailis Aaron Davidson Alan Fekete Ali Ghodsi Joe Hellerstein Ion Stoica UC Berkeley & University of Sydney Virtues and Limitations VLDB 2015, Hangzhou, China, 4 Sept. 2014
“Network partitions should be rare but net gear continues to cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS
MSFT LAN: avg. 40.8 failures/day (95th %ile: 136) 5 min median time to repair (up to 1 week) [SIGCOMM 2011] UC WAN: avg. 16.2–302.0 failures/link/year avg. downtime of 24–497 minutes/link/year [SIGCOMM 2011] HP LAN: 67.1% of support tickets are due to network median incident duration 114-188 min [HP Labs 2012] “Network partitions should be rare but net gear continues to cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS
“THE NETWORK IS RELIABLE” tops Peter Deutsch’s classic list of “Eight fallacies of distributed computing,” all [of which] “prove to be false in the long run and all [of which] cause big trouble and painful learning experiences” (https://blogs.oracle. com/jag/resource/Fallacies.html). Accounting for and understanding the implications of network behavior is key to designing robust distributed programs— in fact, six of Deutsch’s “fallacies” directly pertain to limitations on networked communications. This should be unsurprising: the ability (and often requirement) to communicate over a shared channel possibility and impossibility of perform- ing distributed computations under particular sets of network conditions. For example, the celebrated FLP impossibility result9 demonstrates the inability to guarantee consensus in an asynchronous network (that is, one facing indefinite communication partitions between processes) with one faulty process. This means that, in the presence of unreliable (untimely) mes- sage delivery, basic operations such as modifying the set of machines in a cluster (that is, maintaining group membership, as systems such as Zoo- keeper are tasked with today) are not guaranteed to complete in the event of both network asynchrony and indi- vidual server failures. Related results describe the inability to guarantee the progress of serializable transactions,7 linearizable reads/writes,11 and a variety of useful, programmer-friendly guar- antees under adverse conditions.3 The implications of these results are not simply academic: these impossibility results have motivated a proliferation of systems and designs offering a range of alternative guarantees in the event of network failures.5 However, under a friendlier, more reliable network that guarantees timely message delivery, FLP and many of these related results no longer hold:8 by making stronger guarantees about network behavior, we can circumvent the programmabil- ity implications of these impossibility proofs. Therefore, the degree of reliability in deployment environments is critical in robust systems design and directly determines the kinds of operations that systems can reliably perform with- out waiting. Unfortunately, the degree to which networks are actually reliable in the real world is the subject of con- siderable and evolving debate. Some have claimed that networks are reliable (or that partitions are rare enough in practice) and that we are too concerned with designing for theoretical failure The Network Is Reliable DOI:10.1145/2643130 Article development led by queue.acm.org An informal survey of real-world communications failures. BY PETER BAILIS AND KYLE KINGSBURY CACM, September 2014 issue
High Availability System guarantees a response, even during network partitions (async network) [Gilbert and Lynch, ACM SIGACT News 2002] [“PACELC,” Abadi, IEEE Computer 2012] Corollary: low latency, especially over WAN
LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY
ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA Actian Ingres YES Aerospike NO N Persistit NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weak models by default Serializability supported?
Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions read from a snapshot of DB
Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB
Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB
Read Committed (RC) Transactions buffer writes until commit time Replicas never serve dirty or non-final writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB Unavailable implementations ⇏ unavailable semantics
Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available
Snapshot reads of database state (database does not change) including predicate-based reads + ANSI Repeatable Read Read Atomic Isolation (+TA) Read your writes Time doesn’t go backwards Writes follow reads + Causal Consistency Observe all or none of another txn’s updates
https://github.com/pbailis/hat-vldb2014-code Experimental Validation Thrift - based sharded key - value store with LevelDB for persistence Focus on “CP” vs. HAT overheads cluster A cluster B
Also in paper In-depth discussion of isolation guarantees Extending “AP” to transactional context Sticky availability and sessions Discussion of atomicity and durability More evaluation
This paper: All about coordination + isolation levels (Some surprising results!) How else can databases benefit? How do we address whole programs? Our experience: isolation levels are unintuitive!
RAMP Transactions: new isolation model and coordination-free implementation of indexing, matviews, multi-put [SIGMOD14] I-confluence: which integrity constraints are enforceable without coordination? OLTPBench suite plus general theory [VLDB 2015] Real-world applications: analysis of open- source applications for coordination requirements; similar results [In preparation] Distributed optimization: Numerical convex programs have close analogues to transaction-processing techniques [In preparation]
PUNCHLINE: Coordination is avoidable surprisingly often Need to understand use cases + semantics The use cases are staggering in number and often in plain sight Hint: look to applications, big systems in the wild We have a huge opportunity to improve theory and practice by understanding what’s possible