Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Push It to the Limit: Considerations for Building Reliable Systems

Push It to the Limit: Considerations for Building Reliable Systems

What are implicit and explicit limits? What is resilience engineering? What can we do to build more reliable systems? In this talk, we look at why everything has a limit—sometimes obvious and sometimes hidden—and what this means for you as a developer. We'll look at how we can draw inspiration from other engineering disciplines and how we can apply this knowledge to build more robust products. After listening to this talk, you will walk away with a better understanding of what resilience engineering is and how it applies to your job as a software engineer.

Dcbf01e42178cd9698fb3d4806e33d84?s=128

Tyler Treat

April 25, 2017
Tweet

Transcript

  1. Push It to the Limit: Considerations for Building Reliable Systems

    Tyler Treat April 25, 2017
  2. The Fundamental Failure-Mode Theorem: Complex systems usually operate in failure

    mode.
  3. Complex System

  4. Complex System

  5. Complex System

  6. Complex System

  7. Complex System

  8. 90% availability 95% availability 99% availability Complex System

  9. Complex System full availability 90% ∧ 95% ∧ 99% =

    84.6% 90% availability 95% availability 99% availability
  10. Complex System partial availability 90% ∨ 95% ∨ 99% =

    99.995% 90% availability 95% availability 99% availability
  11. High availability hinges on the ability to be partially available.

  12. The larger the system, the greater the probability of unexpected

    failure.
  13. Resilience engineering means designing with failure as the normal.

  14. Steps to Resilience Engineering Zen: 1. Anticipate failure 2. Embrace

    failure 3. Meditate
  15. What does it mean to embrace failure?

  16. Failing on purpose is better than failing in unpredictable or

    unexpected ways.
  17. Tell the client "no."

  18. Backpressure

  19. Fundamentally, backpressure is about enforcing limits.

  20. queue lengths bandwidth throttling traffic shaping message rate limits max

    payload sizes concurrency limits file descriptor limits timeouts
  21. "Sometimes traffic moves faster when there are a few well-placed

    red lights." —Buddhist SRE proverb
  22. Implicit vs. Explicit Limits

  23. Relying on implicit limits is like saying you know when

    to stop drinking because you eventually pass out.
  24. Parkinson's law: Work expands so as to fill the time

    available for its completion.
  25. Parkinson's law generalized: The demand upon a resource tends to

    expand to match the supply of the resource.
  26. None
  27. None
  28. Parkinson's law states that as soon as you increase that

    timeout to 1 minute, you'll start seeing 2-minute long requests.
  29. Your first instinct should not be to increase timeouts, increase

    memory, increase CPU, etc.
  30. Limits are important because they prevent bad actors or yourself

    from DoSing your system.
  31. Limits define operating boundaries.

  32. Let's look at an example: message size limits.

  33. Eighth fallacy of distributed computing: The network is homogeneous.

  34. None
  35. None
  36. None
  37. None
  38. None
  39. When we choose not to put an upper bound on

    message sizes, we are making an implicit assumption.
  40. Any actor may send a message of arbitrary size...

  41. so all downstream consumers must support arbitrarily large messages.

  42. Everyone you interact with, directly or indirectly, enters an unspoken,

    arbitrarily binding contract.
  43. How can we test something that is arbitrary?

  44. We can't!

  45. Two options: 1) Make the limit explicit. 2) Keep the

    implicit contract.
  46. 1) Make the limit explicit. Allows us to define our

    operating boundaries and gives us something to test.
  47. None
  48. 2) Keep the implicit contract. Gambles reliability for convenience. The

    limit is still there, it's just hidden.
  49. None
  50. Without an explicit limit, a client could easily doom itself

    by accidentally requesting too much data.
  51. None
  52. US National Highway System

  53. Federal Bridge Gross Weight Formula The federal maximum weight is

    set at 80,000 pounds. Trucks exceeding the federal weight limit can still operate on the country's highways with an overweight permit, but such permits are only issued before the scheduled trip and expire at the end of the trip. Overweight permits are only issued for loads that cannot be broken down to smaller shipments that fall below the federal weight limit, and if there is no other alternative to moving the cargo by truck.
  54. Limits in Practice SQS: 256KB message size limit, 120,000 in-flight

    messages (20,000 for FIFO) Kinesis: 1MB message size limit Kafka: 1MB message size limit NATS: 1MB message size limit GAE pull queues: 1MB message size limit GAE frontend instances: 60-second request deadline Lambda: 128KB event size limit, 300-second request deadline, 400 max concurrent executions
  55. 1) Define your operating boundaries. What are your SLAs, what

    are your capacity needs, what are your cost requirements? 2) Use your operating boundaries to drive your design. Architecture, APIs, business logic, UX fall out from your operating boundaries.
  56. Unbounded anything is a resilience engineering anti-pattern.

  57. Explicit limits restrict the failure domain giving us more predictability.

  58. None