Slide 1

Slide 1 text

Push It to the Limit: Considerations for Building Reliable Systems Tyler Treat April 25, 2017

Slide 2

Slide 2 text

The Fundamental Failure-Mode Theorem: Complex systems usually operate in failure mode.

Slide 3

Slide 3 text

Complex System

Slide 4

Slide 4 text

Complex System

Slide 5

Slide 5 text

Complex System

Slide 6

Slide 6 text

Complex System

Slide 7

Slide 7 text

Complex System

Slide 8

Slide 8 text

90% availability 95% availability 99% availability Complex System

Slide 9

Slide 9 text

Complex System full availability 90% ∧ 95% ∧ 99% = 84.6% 90% availability 95% availability 99% availability

Slide 10

Slide 10 text

Complex System partial availability 90% ∨ 95% ∨ 99% = 99.995% 90% availability 95% availability 99% availability

Slide 11

Slide 11 text

High availability hinges on the ability to be partially available.

Slide 12

Slide 12 text

The larger the system, the greater the probability of unexpected failure.

Slide 13

Slide 13 text

Resilience engineering means designing with failure as the normal.

Slide 14

Slide 14 text

Steps to Resilience Engineering Zen: 1. Anticipate failure 2. Embrace failure 3. Meditate

Slide 15

Slide 15 text

What does it mean to embrace failure?

Slide 16

Slide 16 text

Failing on purpose is better than failing in unpredictable or unexpected ways.

Slide 17

Slide 17 text

Tell the client "no."

Slide 18

Slide 18 text

Backpressure

Slide 19

Slide 19 text

Fundamentally, backpressure is about enforcing limits.

Slide 20

Slide 20 text

queue lengths bandwidth throttling traffic shaping message rate limits max payload sizes concurrency limits file descriptor limits timeouts

Slide 21

Slide 21 text

"Sometimes traffic moves faster when there are a few well-placed red lights." —Buddhist SRE proverb

Slide 22

Slide 22 text

Implicit vs. Explicit Limits

Slide 23

Slide 23 text

Relying on implicit limits is like saying you know when to stop drinking because you eventually pass out.

Slide 24

Slide 24 text

Parkinson's law: Work expands so as to fill the time available for its completion.

Slide 25

Slide 25 text

Parkinson's law generalized: The demand upon a resource tends to expand to match the supply of the resource.

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Parkinson's law states that as soon as you increase that timeout to 1 minute, you'll start seeing 2-minute long requests.

Slide 29

Slide 29 text

Your first instinct should not be to increase timeouts, increase memory, increase CPU, etc.

Slide 30

Slide 30 text

Limits are important because they prevent bad actors or yourself from DoSing your system.

Slide 31

Slide 31 text

Limits define operating boundaries.

Slide 32

Slide 32 text

Let's look at an example: message size limits.

Slide 33

Slide 33 text

Eighth fallacy of distributed computing: The network is homogeneous.

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

When we choose not to put an upper bound on message sizes, we are making an implicit assumption.

Slide 40

Slide 40 text

Any actor may send a message of arbitrary size...

Slide 41

Slide 41 text

so all downstream consumers must support arbitrarily large messages.

Slide 42

Slide 42 text

Everyone you interact with, directly or indirectly, enters an unspoken, arbitrarily binding contract.

Slide 43

Slide 43 text

How can we test something that is arbitrary?

Slide 44

Slide 44 text

We can't!

Slide 45

Slide 45 text

Two options: 1) Make the limit explicit. 2) Keep the implicit contract.

Slide 46

Slide 46 text

1) Make the limit explicit. Allows us to define our operating boundaries and gives us something to test.

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

2) Keep the implicit contract. Gambles reliability for convenience. The limit is still there, it's just hidden.

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

Without an explicit limit, a client could easily doom itself by accidentally requesting too much data.

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

US National Highway System

Slide 53

Slide 53 text

Federal Bridge Gross Weight Formula The federal maximum weight is set at 80,000 pounds. Trucks exceeding the federal weight limit can still operate on the country's highways with an overweight permit, but such permits are only issued before the scheduled trip and expire at the end of the trip. Overweight permits are only issued for loads that cannot be broken down to smaller shipments that fall below the federal weight limit, and if there is no other alternative to moving the cargo by truck.

Slide 54

Slide 54 text

Limits in Practice SQS: 256KB message size limit, 120,000 in-flight messages (20,000 for FIFO) Kinesis: 1MB message size limit Kafka: 1MB message size limit NATS: 1MB message size limit GAE pull queues: 1MB message size limit GAE frontend instances: 60-second request deadline Lambda: 128KB event size limit, 300-second request deadline, 400 max concurrent executions

Slide 55

Slide 55 text

1) Define your operating boundaries. What are your SLAs, what are your capacity needs, what are your cost requirements? 2) Use your operating boundaries to drive your design. Architecture, APIs, business logic, UX fall out from your operating boundaries.

Slide 56

Slide 56 text

Unbounded anything is a resilience engineering anti-pattern.

Slide 57

Slide 57 text

Explicit limits restrict the failure domain giving us more predictability.

Slide 58

Slide 58 text

No content