Slide 1

Slide 1 text

CAP Theorem: 
 not what we thought it was, not what we are looking for Shlomi Noach GitHub DevOpsDays TLV 2019

Slide 2

Slide 2 text

About me @github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and others. Blog at http://openark.org 
 github.com/shlomi-noach
 @ShlomiNoach

Slide 3

Slide 3 text

GitHub
 Built for developers Largest open source hosting 40M+ developers
 2.9M+ organizations
 100M+ repositories, Supplier of octocat T-Shirts and stickers

Slide 4

Slide 4 text

Incentive Important part of our work is to keep the service available, so that users/customers have access to their data and can operate on their data.

Slide 5

Slide 5 text

CAP Conjecture Suggested by Eric Brewer, 1998-1999 Terms: • [strong] Consistency • [high] Availability • Partition tolerance Strong Consistency, High Availability, Partition-resilience: Pick at most 2. C A P

Slide 6

Slide 6 text

CAP Theorem Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Seth Gilbert, Nancy Lynch

Slide 7

Slide 7 text

CAP Theorem A mathematical proof Uses different definition of Availability than CAP Conjecture’s definition Neither of the Availability definitions contain one another (or, are stronger than the other)

Slide 8

Slide 8 text

CAP Theorem In Math, a theorem only holds as long as its terms & conditions are met. CAP Theorem does not prove the CAP Conjecture.

Slide 9

Slide 9 text

CAP Theorem Terms, according to Gilbert & Lynch: - Consistency - Availability - Partition tolerance

Slide 10

Slide 10 text

CAP Theorem: Consistency Once a write is successful on a node, any read on any node must reflect that write or any later write. aka Atomic Consistency, aka Strong consistency, aka Linear consistency, aka Linearizability. C

Slide 11

Slide 11 text

CAP Theorem: Availability …every request received by a non-failing node in the system must result in a response. There is no constraint on the actual amount of time. Though it is not specified, it is implied that response must be valid, non-error. Contrast with Brewer’s definition of High availability: data is considered highly available if a given consumer of the data can always reach some replica. A

Slide 12

Slide 12 text

CAP Theorem: Partition Tolerance The system is able to operate on network partitioning. Partition tolerance is considered as a given condition, since network partitioning can and does take place regardless of a system’s design. P

Slide 13

Slide 13 text

CAP Theorem A distributed data store [web service] cannot provide more than two out of the three properties. Better illustrated as: • If the network is good, you may achieve both Availability and Consistency. • If the network is partitioned, you must choose between Availability and Consistency.

Slide 14

Slide 14 text

CAP Theorem The discussion is mostly about CP vs AP systems C A P

Slide 15

Slide 15 text

Proof of CAP theorem Simplified, not by much. Given two nodes replicating from each other ! ! n1 n2

Slide 16

Slide 16 text

Proof of CAP theorem We partition the network between the two nodes to an infinite amount of time. ! ! n1 n2

Slide 17

Slide 17 text

Proof of CAP theorem We write data to one node. If the system is Available, the write completes in a finite amount of time. ! ! n1 n2

Slide 18

Slide 18 text

Proof of CAP theorem We read data from the other node. If the system is Available that read completes in a finite mount of time. ! ! n1 n2

Slide 19

Slide 19 text

Proof of CAP theorem During that finite time the network was partitioned, hence our read could not reflect changes made by the write. QED ! ! n1 n2

Slide 20

Slide 20 text

The proof is mathematical Following solid Mathematical principles.

Slide 21

Slide 21 text

The proof is mathematical Reading data after writing is only possible when the write time is finite. Proves impossibility by counter example: we’ve shown a (single) scenario where we can’t have both Availability and Consistency.

Slide 22

Slide 22 text

CAP Theorem CAP does not say “you cannot have both A and C at the same time” * It says: “you cannot design a system where you will have both A and C together at all times” * * Assuming P

Slide 23

Slide 23 text

All purple dragons can fly

Slide 24

Slide 24 text

Vacuous truth A statement that asserts that all members of the empty set have a certain property. [wikipedia]

Slide 25

Slide 25 text

Vacuous truth All purple dragons can fly.

Slide 26

Slide 26 text

Vacuous truth All purple dragons can fly faster than the speed of light.

Slide 27

Slide 27 text

Vacuous truth All purple dragons can answer questions. All purple dragons cannot answer questions.

Slide 28

Slide 28 text

CAP Availability “…every request received by a non-failing node in the system must result in a response.”

Slide 29

Slide 29 text

CAP Availability It is vacuous truth, that if all the nodes in my system are crashed, then my system is Available.

Slide 30

Slide 30 text

Proposal to making my service available ! ! ! ! ! ! Shut down all database servers?

Slide 31

Slide 31 text

Proposal to making my service available ! ! ! ! ! ! Shut down all database servers? The CAP Theorem Availability definition goes against practical definition of availability.

Slide 32

Slide 32 text

Infinite network partitioning

Slide 33

Slide 33 text

Infinite network partitioning Illustrated

Slide 34

Slide 34 text

Meet Anna and Ben, SRE’s at example.com

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

One day, Anna's attention is drawn to what seems to be an unfolding crisis. oh, dear!

Slide 37

Slide 37 text

Ben, are you seeing what I’m seeing?

Slide 38

Slide 38 text

The two investigate It seems like our Virginia router is malfunctioning

Slide 39

Slide 39 text

Confirmed with the datacenter; they saw some smoke, and then it burned. It's literally melted.

Slide 40

Slide 40 text

So our Virginia DC is now network isolated.

Slide 41

Slide 41 text

You know what that means?

Slide 42

Slide 42 text

oh, wow! oh, wow!

Slide 43

Slide 43 text

The router is never coming back. This is an infinite network partitioning!

Slide 44

Slide 44 text

We will never have consistency ever again!

Slide 45

Slide 45 text

There's just no point 
 in anything. CAP 
 Theorem proves this.

Slide 46

Slide 46 text

For eternity!

Slide 47

Slide 47 text

We may as well enjoy our time! We can live A life of adventure!

Slide 48

Slide 48 text

We can go to
 Paris!

Slide 49

Slide 49 text

We can go to
 Rome!

Slide 50

Slide 50 text

We can go to
 DevOpsDays TLV!

Slide 51

Slide 51 text

Woohoo!

Slide 52

Slide 52 text

Enters Carmen, CTO for example.com

Slide 53

Slide 53 text

Hey there! What’s
 going on? Seems like
 we have an outage?

Slide 54

Slide 54 text

Our Virginia router
 burst up in flames.
 It is gone.

Slide 55

Slide 55 text

For eternity. We now
 have an infinite 
 network partitioning, 
 and will never again 
 have consistency.

Slide 56

Slide 56 text

There's just no point 
 in anything. CAP 
 Theorem proves this.

Slide 57

Slide 57 text

Can we expense 
 travel to 
 DevOpsDays TLV?

Slide 58

Slide 58 text

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

ben?

Slide 62

Slide 62 text

Yeah?

Slide 63

Slide 63 text

Here's my corporate credit card. 
 I'd like you to go to the store downtown, buy a new router, take it to our Virginia DataCenter and replace the old router.

Slide 64

Slide 64 text

gulp!

Slide 65

Slide 65 text

Yeah, 
 that also works!

Slide 66

Slide 66 text

Infinite network partitioning Is not a practical situation. Is not something we plan for. Is not something that concerns us as we engineer our systems.

Slide 67

Slide 67 text

Can we trade “infinite” with 
 “long enough”? This would break the proof. If a network partition is capped (pun intended) by t seconds we can stall any read by t+α, conveniently delaying beyond the network partitioning, allowing for consistency. Remember: there is no cap (pun again) to query response time. ! ! n1 n2

Slide 68

Slide 68 text

Can we rewrite CAP Theorem in terms of finite timeouts? Yes: • in a system where a network partition can last more than t • and where we require queries to respond within time t/2 • We can illustrate a proof similar to CAP Theorem’s where given such partitioning we cannot achieve both availability and consistency. • Let’s name it “Time Limited CAP” ! ! n1 n2

Slide 69

Slide 69 text

Questioning “time limited CAP” Is t a given constraint? Why would we necessarily require queries to complete in t/2? Is there system logic / customer logic to the above, or have we tailored this to fit a theorem we had?

Slide 70

Slide 70 text

Questioning “time limited CAP” Opinion: this chase is a fallacy. We seem to be trying to prove a theorem while we disagree with its terms, namely Availability.

Slide 71

Slide 71 text

CAP Conjecture Availability: data is considered highly available if a given consumer of the data can always reach some replica. Can we work with that?

Slide 72

Slide 72 text

n == 2?

Slide 73

Slide 73 text

n > 2 Availability: consumer always has access to data via at least one replica. Some node will have our data (we will likely need to detect which one) ! ! ! ! !

Slide 74

Slide 74 text

CAP is short of CAPacity

Slide 75

Slide 75 text

CAP is short of CAPacity™

Slide 76

Slide 76 text

CAP is short of CAPacity™®

Slide 77

Slide 77 text

CAP is short of CAPacity™® PATENTED

Slide 78

Slide 78 text

CAP Conjecture, capacity What if we agreed to quorum? q-Availability: assuming number of replicas is n, consumer always has access to data via at least ⌊n/2 + 1⌋ replicas. e.g. - in a 5-replicas cluster, quorum would be 3
 - in a 7-replicas cluster, quorum would be 4 ! ! ! ! !

Slide 79

Slide 79 text

Consensus algorithms: 
 Paxos, Raft

Slide 80

Slide 80 text

Consensus algorithms: 
 Paxos, Raft Depending on implementation, can guarantee: • A write is readily Available to read on quorum nodes • A write is made durable on quorum nodes ! ! ! ! !

Slide 81

Slide 81 text

! ! ! ! ! Do consensus algorithms contradict CAP?

Slide 82

Slide 82 text

! ! ! ! ! CAP network partitioning, n > 2

Slide 83

Slide 83 text

! ! ! ! ! CAP network partitioning, n > 2

Slide 84

Slide 84 text

! ! ! ! ! CAP network partitioning, n > 2

Slide 85

Slide 85 text

! ! ! ! ! Reasonable confidence

Slide 86

Slide 86 text

CAP: recap Assuming P, you cannot design a system where you will have both A and C together at all times. Proven by giving a [single] counter example.

Slide 87

Slide 87 text

Absolutism

Slide 88

Slide 88 text

High Availability Commonly measured in 9’s. We agree that absolute availability is unrealistic.

Slide 89

Slide 89 text

CAP 9’s? “In practice, many applications are best described in terms of reduced consistency or availability… … there is a Weak CAP Principle which we have yet to characterize precisely…"

Slide 90

Slide 90 text

CAP is a subset Availability and Consistency are but two (important) properties of modern distributed systems. Geo distributions, transactions, latency, durability, consistent distributed snapshots, DR, failovers, backup options, mean time to recover, observability, operability…(some properties correlate) Which are important to your service?

Slide 91

Slide 91 text

The tradeoff exists We understand this as engineers. Multiple other tradeoffs exist. I believe CAP is not the model we should be looking for.

Slide 92

Slide 92 text

Harvest, Yield, and Scalable Tolerant Systems, Armando Fox, Eric A. Brewer
 https://pdfs.semanticscholar.org/5015/8bc1a8a67295ab7bce0550886a9859000dc2.pdf Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services, Seth Gilbert, Nancy Lynch
 https://www.glassbeam.com/sites/all/themes/glassbeam/images/blog/10.1.1.67.6951.pdf A Critique of the CAP Theorem, Martin Kleppmann
 https://arxiv.org/pdf/1509.05393.pdf CAP Twelve Years Later: How the "Rules" Have Changed, Eric Brewer
 https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed Please stop calling databases CP or AP, Martin Kleppmann
 https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html Further reading

Slide 93

Slide 93 text

NewSQL database systems are failing to guarantee consistency, and I blame Spanner, Daniel Abadi
 http://dbmsmusings.blogspot.com/2018/09/newsql-database-systems-are-failing-to.html Problems with CAP, and Yahoo’s little known NoSQL system, Daniel Abadi
 http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html PACELC theorem, Wikipedia
 https://en.wikipedia.org/wiki/PACELC_theorem "A Critique of the CAP Theorem”, Julia Evans
 https://jvns.ca/blog/2016/11/19/a-critique-of-the-cap-theorem/ CAP Theorem, FoundationDB
 https://apple.github.io/foundationdb/cap-theorem.html Further reading

Slide 94

Slide 94 text

Questions? github.com/shlomi-noach @ShlomiNoach Thank you!