Push It to the Limit: Considerations for Building Reliable Systems

Push It to the Limit: Considerations for Building Reliable Systems
Tyler Treat April 25, 2017

The Fundamental Failure-Mode Theorem: Complex systems usually operate in failure
mode.

Complex System

90% availability 95% availability 99% availability Complex System

Complex System full availability 90% ∧ 95% ∧ 99% =
84.6% 90% availability 95% availability 99% availability

Complex System partial availability 90% ∨ 95% ∨ 99% =
99.995% 90% availability 95% availability 99% availability

High availability hinges on the ability to be partially available.

The larger the system, the greater the probability of unexpected
failure.

Resilience engineering means designing with failure as the normal.

Steps to Resilience Engineering Zen: 1. Anticipate failure 2. Embrace
failure 3. Meditate

What does it mean to embrace failure?

Failing on purpose is better than failing in unpredictable or
unexpected ways.

Tell the client "no."

Backpressure

Fundamentally, backpressure is about enforcing limits.

queue lengths bandwidth throttling traffic shaping message rate limits max
payload sizes concurrency limits file descriptor limits timeouts

"Sometimes traffic moves faster when there are a few well-placed
red lights." —Buddhist SRE proverb

Implicit vs. Explicit Limits

Relying on implicit limits is like saying you know when
to stop drinking because you eventually pass out.

Parkinson's law: Work expands so as to fill the time
available for its completion.

Parkinson's law generalized: The demand upon a resource tends to
expand to match the supply of the resource.

Parkinson's law states that as soon as you increase that
timeout to 1 minute, you'll start seeing 2-minute long requests.

Your first instinct should not be to increase timeouts, increase
memory, increase CPU, etc.

Limits are important because they prevent bad actors or yourself
from DoSing your system.

Limits define operating boundaries.

Let's look at an example: message size limits.

Eighth fallacy of distributed computing: The network is homogeneous.

When we choose not to put an upper bound on
message sizes, we are making an implicit assumption.

Any actor may send a message of arbitrary size...

so all downstream consumers must support arbitrarily large messages.

Everyone you interact with, directly or indirectly, enters an unspoken,
arbitrarily binding contract.

How can we test something that is arbitrary?

We can't!

Two options: 1) Make the limit explicit. 2) Keep the
implicit contract.

1) Make the limit explicit. Allows us to define our
operating boundaries and gives us something to test.

2) Keep the implicit contract. Gambles reliability for convenience. The
limit is still there, it's just hidden.

Without an explicit limit, a client could easily doom itself
by accidentally requesting too much data.

US National Highway System

Federal Bridge Gross Weight Formula The federal maximum weight is
set at 80,000 pounds. Trucks exceeding the federal weight limit can still operate on the country's highways with an overweight permit, but such permits are only issued before the scheduled trip and expire at the end of the trip. Overweight permits are only issued for loads that cannot be broken down to smaller shipments that fall below the federal weight limit, and if there is no other alternative to moving the cargo by truck.

Limits in Practice SQS: 256KB message size limit, 120,000 in-flight
messages (20,000 for FIFO) Kinesis: 1MB message size limit Kafka: 1MB message size limit NATS: 1MB message size limit GAE pull queues: 1MB message size limit GAE frontend instances: 60-second request deadline Lambda: 128KB event size limit, 300-second request deadline, 400 max concurrent executions

1) Define your operating boundaries. What are your SLAs, what
are your capacity needs, what are your cost requirements? 2) Use your operating boundaries to drive your design. Architecture, APIs, business logic, UX fall out from your operating boundaries.

Unbounded anything is a resilience engineering anti-pattern.

Explicit limits restrict the failure domain giving us more predictability.

Push It to the Limit: Considerations for Buildi...

Push It to the Limit: Considerations for Building Reliable Systems

More Decks by Tyler Treat

Other Decks in Programming

Featured

Transcript