Highly Available Transactions: Virtues and Limitations

Highly Available Transactions Peter Bailis Aaron Davidson Alan Fekete Ali
Ghodsi Joe Hellerstein Ion Stoica UC Berkeley & University of Sydney Virtues and Limitations VLDB 2015, Hangzhou, China, 4 Sept. 2014

July 2000: CAP Theorem

High Availability

High Availability [Gilbert and Lynch, ACM SIGACT News 2002] System
guarantees a response, even during network partitions (async network)

network partitions

NETWORK PARTITIONS

“Network partitions should be rare but net gear continues to
cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS

MSFT LAN: avg. 40.8 failures/day (95th %ile: 136) 5 min
median time to repair (up to 1 week) [SIGCOMM 2011] UC WAN: avg. 16.2–302.0 failures/link/year avg. downtime of 24–497 minutes/link/year [SIGCOMM 2011] HP LAN: 67.1% of support tickets are due to network median incident duration 114-188 min [HP Labs 2012] “Network partitions should be rare but net gear continues to cause more issues than it should.” --James Hamilton, Amazon Web Services [perspectives.mvdirona.com, 2010] NETWORK PARTITIONS

“THE NETWORK IS RELIABLE” tops Peter Deutsch’s classic list of
“Eight fallacies of distributed computing,” all [of which] “prove to be false in the long run and all [of which] cause big trouble and painful learning experiences” (https://blogs.oracle. com/jag/resource/Fallacies.html). Accounting for and understanding the implications of network behavior is key to designing robust distributed programs— in fact, six of Deutsch’s “fallacies” directly pertain to limitations on networked communications. This should be unsurprising: the ability (and often requirement) to communicate over a shared channel possibility and impossibility of perform- ing distributed computations under particular sets of network conditions. For example, the celebrated FLP impossibility result9 demonstrates the inability to guarantee consensus in an asynchronous network (that is, one facing indeﬁnite communication partitions between processes) with one faulty process. This means that, in the presence of unreliable (untimely) message delivery, basic operations such as modifying the set of machines in a cluster (that is, maintaining group membership, as systems such as Zoo- keeper are tasked with today) are not guaranteed to complete in the event of both network asynchrony and indi- vidual server failures. Related results describe the inability to guarantee the progress of serializable transactions,7 linearizable reads/writes,11 and a variety of useful, programmer-friendly guarantees under adverse conditions.3 The implications of these results are not simply academic: these impossibility results have motivated a proliferation of systems and designs offering a range of alternative guarantees in the event of network failures.5 However, under a friendlier, more reliable network that guarantees timely message delivery, FLP and many of these related results no longer hold:8 by making stronger guarantees about network behavior, we can circumvent the programmabil- ity implications of these impossibility proofs. Therefore, the degree of reliability in deployment environments is critical in robust systems design and directly determines the kinds of operations that systems can reliably perform without waiting. Unfortunately, the degree to which networks are actually reliable in the real world is the subject of con- siderable and evolving debate. Some have claimed that networks are reliable (or that partitions are rare enough in practice) and that we are too concerned with designing for theoretical failure The Network Is Reliable DOI:10.1145/2643130 Article development led by queue.acm.org An informal survey of real-world communications failures. BY PETER BAILIS AND KYLE KINGSBURY CACM, September 2014 issue

High Availability System guarantees a response, even during network partitions
(async network) [Gilbert and Lynch, ACM SIGACT News 2002]

High Availability System guarantees a response, even during network partitions
(async network) [Gilbert and Lynch, ACM SIGACT News 2002] [“PACELC,” Abadi, IEEE Computer 2012] Corollary: low latency, especially over WAN

low latency

LAN 0.5ms 1x Co-located WAN 1-3.5ms 2-7x WAN 22-360ms 44-720x
average latency from 1 week on ec2 http://www.bailis.org/blog/communication-costs-in-real-world-networks/ LOW LATENCY

THOSE LIGHT CONES_

July 2000: CAP Theorem

“BQ”!jt!gvoebnfoubmmz!bcpvu!

“BQ”!jt!gvoebnfoubmmz!bcpvu! avoiding coordination

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability avoiding coordination

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency avoiding coordination

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput avoiding coordination

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput Aggressive Scale-out avoiding coordination

“BQ”!jt!gvoebnfoubmmz!bcpvu! Availability Low Latency High Throughput Aggressive Scale-out cf. “Coordination
Avoidance in Database Systems” to appear in VLDB 2015 avoiding coordination

CONSISTENCY  vs COORDINATION

CONSISTENCY  vs AVAILABILITY

CONSISTENCY  vs AVAILABILITY Linearizability “Atomic” C in CAP

CONSISTENCY  vs AVAILABILITY Eventual Linearizability “Atomic” C in CAP

OpTRM!

Strong consistency is expensive; avoid whenever possible! OpTRM!

CAP implies transactions are unavailable Common (mis)conception: Strong consistency is
expensive; avoid whenever possible! OpTRM!

CAP is about linearizability CAP doesn’t mention transactions

Was the NoSQL movement right?

Was the NoSQL movement right? Are all transactions unavailable?

serializability Is achievable with HA?

Serializability is expensive

! Use weaker models instead Serializability is expensive

ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA

ep!opu!tvqqpsu! tfsjbmj{bcjmjuz HANA Actian Ingres YES Aerospike NO N Persistit
NO N Clustrix NO N Greenplum YES IBM DB2 YES IBM Informix YES MySQL YES MemSQL NO N MS SQL Server YES NuoDB NO N Oracle 11G NO N Oracle BDB YES Oracle BDB JE YES Postgres 9.2.2 YES SAP Hana NO N ScaleDB NO N VoltDB YES 8/18 databases surveyed did not 15/18 used weak models by default Serializability supported?

serializability

serializability snapshot isolation read committed repeatable read cursor stability read
uncommitted monotonic view update serializability

uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA?

uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? Highly Available Transactions

uncommitted monotonic view update serializability HA? HA? HA? HA? HA? HA? HA? HATs

[Atul Adya Ph.D. Thesis, MIT 2000]

Challenge: traditional implementations are unavailable

Unavailable Sticky Available Highly Available Legend prevents lost update†, prevents
write skew‡, requires recency guarantees⊕ Sticky Available Unavailable Highly Available

Fyjtujoh! Ebubcbtf! Jtpmbujpo

Fyjtujoh! Ebubcbtf! Jtpmbujpo Ejtusjcvufe! Sfhjtufst!

Fyjtujoh! Ebubcbtf! Jtpmbujpo Tfttjpo!Hvbsboufft Ejtusjcvufe! Sfhjtufst!

Read Committed (RC)

Read Committed (RC) Replicas never serve dirty or non-ﬁnal writes

Read Committed (RC) Transactions buffer writes until commit time Replicas
never serve dirty or non-ﬁnal writes

never serve dirty or non-ﬁnal writes ANSI Repeatable Read (RR)

never serve dirty or non-ﬁnal writes ANSI Repeatable Read (RR) Transactions read from a snapshot of DB

never serve dirty or non-ﬁnal writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB

never serve dirty or non-ﬁnal writes ANSI Repeatable Read (RR) Transactions buffer reads from replicas Transactions read from a snapshot of DB Unavailable implementations ⇏ unavailable semantics

Snapshot reads of database state (database does not change) including
predicate-based reads + ANSI Repeatable Read Read Atomic Isolation (+TA) Read your writes Time doesn’t go backwards Writes follow reads + Causal Consistency Observe all or none of another txn’s updates

https://github.com/pbailis/hat-vldb2014-code Experimental Validation Thrift - based sharded key - value
store with LevelDB for persistence Focus on “CP” vs. HAT overheads cluster A cluster B

2 clusters in us-east 5 servers/cluster transactions of length 8
50% reads, 50% writes

2 clusters in us-east 5 servers/cluster transactions of length 8
50% reads, 50% writes 0 200 400 600 800 1000 0 20 40 60 80 100 120 Avg. Latency (ms) Eventual RC TA Master

0 200 400 600 800 1000 0 20 40 60
80 100 120 Avg. Latency (ms) Eventual RC TA Master 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes

0 200 400 600 800 1000 0 20 40 60
80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes

0 200 400 600 800 1000 0 20 40 60
80 100 120 Avg. Latency (ms) Eventual RC TA Master Mastered 2x latency of HATs 2 clusters in us-east 5 servers/cluster transactions of length 8 50% reads, 50% writes 128K ops/s

clusters in us-east, us-west 5 servers/DC transactions of length 8
50% reads, 50% writes

0 200 400 600 800 1000 Avg. Latency (ms) Eventual
RC TA Master clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes

RC TA Master 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes

RC TA Master Mastered 2-70x latency of HATs 300ms clusters in us-east, us-west 5 servers/DC transactions of length 8 50% reads, 50% writes

CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length
8 50% reads, 50% writes

RC TA Master CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes

RC TA Master 800ms CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes

RC TA Master 800ms Mastered 8-186x latency of HATs CA, VA, OR, Ireland, Singapore 5 servers/DC transactions of length 8 50% reads, 50% writes

Also in paper In-depth discussion of isolation guarantees Extending “AP”
to transactional context Sticky availability and sessions Discussion of atomicity and durability More evaluation

This paper: All about coordination + isolation levels (Some surprising
results!) How else can databases benefit? How do we address whole programs? Our experience: isolation levels are unintuitive!

RAMP Transactions: new isolation model and coordination-free implementation of indexing,
matviews, multi-put [SIGMOD14] I-conﬂuence: which integrity constraints are enforceable without coordination? OLTPBench suite plus general theory [VLDB 2015] Real-world applications: analysis of open- source applications for coordination requirements; similar results [In preparation] Distributed optimization: Numerical convex programs have close analogues to transaction-processing techniques [In preparation]

PUNCHLINE: Coordination is avoidable surprisingly often Need to understand use
cases + semantics The use cases are staggering in number and often in plain sight Hint: look to applications, big systems in the wild We have a huge opportunity to improve theory and practice by understanding what’s possible

Highly Available Transactions: Virtues and Limi...

Highly Available Transactions: Virtues and Limitations

More Decks by pbailis

Other Decks in Technology

Featured

Transcript