Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Push It to the Limit: Considerations for Building Reliable Systems

Push It to the Limit: Considerations for Building Reliable Systems

What are implicit and explicit limits? What is resilience engineering? What can we do to build more reliable systems? In this talk, we look at why everything has a limit—sometimes obvious and sometimes hidden—and what this means for you as a developer. We'll look at how we can draw inspiration from other engineering disciplines and how we can apply this knowledge to build more robust products. After listening to this talk, you will walk away with a better understanding of what resilience engineering is and how it applies to your job as a software engineer.

Tyler Treat

April 25, 2017
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. Push It to the Limit:
    Considerations for Building Reliable Systems
    Tyler Treat
    April 25, 2017

    View full-size slide

  2. The Fundamental Failure-Mode Theorem:
    Complex systems usually operate in failure
    mode.

    View full-size slide

  3. Complex System

    View full-size slide

  4. Complex System

    View full-size slide

  5. Complex System

    View full-size slide

  6. Complex System

    View full-size slide

  7. Complex System

    View full-size slide

  8. 90%
    availability
    95%
    availability
    99%
    availability
    Complex System

    View full-size slide

  9. Complex System
    full availability
    90% ∧ 95% ∧ 99% = 84.6%
    90%
    availability
    95%
    availability
    99%
    availability

    View full-size slide

  10. Complex System
    partial availability
    90% ∨ 95% ∨ 99% = 99.995%
    90%
    availability
    95%
    availability
    99%
    availability

    View full-size slide

  11. High availability hinges on the
    ability to be partially available.

    View full-size slide

  12. The larger the system, the greater
    the probability of unexpected failure.

    View full-size slide

  13. Resilience engineering means
    designing with failure as the normal.

    View full-size slide

  14. Steps to Resilience Engineering Zen:
    1. Anticipate failure
    2. Embrace failure
    3. Meditate

    View full-size slide

  15. What does it mean to
    embrace failure?

    View full-size slide

  16. Failing on purpose is better than
    failing in unpredictable or
    unexpected ways.

    View full-size slide

  17. Tell the client "no."

    View full-size slide

  18. Backpressure

    View full-size slide

  19. Fundamentally, backpressure is
    about enforcing limits.

    View full-size slide

  20. queue lengths
    bandwidth throttling
    traffic shaping
    message rate limits
    max payload sizes concurrency limits
    file descriptor limits
    timeouts

    View full-size slide

  21. "Sometimes traffic moves faster when there are a few
    well-placed red lights." —Buddhist SRE proverb

    View full-size slide

  22. Implicit vs. Explicit
    Limits

    View full-size slide

  23. Relying on implicit limits is
    like saying you know when
    to stop drinking because
    you eventually pass out.

    View full-size slide

  24. Parkinson's law:
    Work expands so as to fill the time
    available for its completion.

    View full-size slide

  25. Parkinson's law generalized:
    The demand upon a resource tends
    to expand to match the supply of the
    resource.

    View full-size slide

  26. Parkinson's law states that as soon as you
    increase that timeout to 1 minute, you'll
    start seeing 2-minute long requests.

    View full-size slide

  27. Your first instinct should not be to
    increase timeouts, increase memory,
    increase CPU, etc.

    View full-size slide

  28. Limits are important because they
    prevent bad actors or yourself from
    DoSing your system.

    View full-size slide

  29. Limits define operating
    boundaries.

    View full-size slide

  30. Let's look at an example:
    message size limits.

    View full-size slide

  31. Eighth fallacy of distributed computing:
    The network is homogeneous.

    View full-size slide

  32. When we choose not to put an upper
    bound on message sizes, we are
    making an implicit assumption.

    View full-size slide

  33. Any actor may send a message of
    arbitrary size...

    View full-size slide

  34. so all downstream consumers must
    support arbitrarily large messages.

    View full-size slide

  35. Everyone you interact with, directly
    or indirectly, enters an unspoken,
    arbitrarily binding contract.

    View full-size slide

  36. How can we test something
    that is arbitrary?

    View full-size slide

  37. Two options:
    1) Make the limit explicit.
    2) Keep the implicit contract.

    View full-size slide

  38. 1) Make the limit explicit.
    Allows us to define our operating
    boundaries and gives us
    something to test.

    View full-size slide

  39. 2) Keep the implicit contract.
    Gambles reliability for
    convenience. The limit is still
    there, it's just hidden.

    View full-size slide

  40. Without an explicit limit, a client could
    easily doom itself by accidentally
    requesting too much data.

    View full-size slide

  41. US National Highway System

    View full-size slide

  42. Federal Bridge Gross Weight Formula
    The federal maximum weight is set at 80,000 pounds. Trucks exceeding
    the federal weight limit can still operate on the country's highways with
    an overweight permit, but such permits are only issued before the
    scheduled trip and expire at the end of the trip. Overweight permits are
    only issued for loads that cannot be broken down to smaller shipments
    that fall below the federal weight limit, and if there is no other
    alternative to moving the cargo by truck.

    View full-size slide

  43. Limits in Practice
    SQS: 256KB message size limit, 120,000 in-flight messages
    (20,000 for FIFO)
    Kinesis: 1MB message size limit
    Kafka: 1MB message size limit
    NATS: 1MB message size limit
    GAE pull queues: 1MB message size limit
    GAE frontend instances: 60-second request deadline
    Lambda: 128KB event size limit, 300-second request deadline,
    400 max concurrent executions

    View full-size slide

  44. 1) Define your operating boundaries.
    What are your SLAs, what are your capacity
    needs, what are your cost requirements?
    2) Use your operating boundaries to drive your
    design.
    Architecture, APIs, business logic, UX fall out
    from your operating boundaries.

    View full-size slide

  45. Unbounded anything is a resilience
    engineering anti-pattern.

    View full-size slide

  46. Explicit limits restrict the failure domain
    giving us more predictability.

    View full-size slide