Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Push It to the Limit: Considerations for Building Reliable Systems

Push It to the Limit: Considerations for Building Reliable Systems

What are implicit and explicit limits? What is resilience engineering? What can we do to build more reliable systems? In this talk, we look at why everything has a limit—sometimes obvious and sometimes hidden—and what this means for you as a developer. We'll look at how we can draw inspiration from other engineering disciplines and how we can apply this knowledge to build more robust products. After listening to this talk, you will walk away with a better understanding of what resilience engineering is and how it applies to your job as a software engineer.

Tyler Treat

April 25, 2017
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. Complex System full availability 90% ∧ 95% ∧ 99% =

    84.6% 90% availability 95% availability 99% availability
  2. Complex System partial availability 90% ∨ 95% ∨ 99% =

    99.995% 90% availability 95% availability 99% availability
  3. queue lengths bandwidth throttling traffic shaping message rate limits max

    payload sizes concurrency limits file descriptor limits timeouts
  4. Relying on implicit limits is like saying you know when

    to stop drinking because you eventually pass out.
  5. Parkinson's law generalized: The demand upon a resource tends to

    expand to match the supply of the resource.
  6. Parkinson's law states that as soon as you increase that

    timeout to 1 minute, you'll start seeing 2-minute long requests.
  7. When we choose not to put an upper bound on

    message sizes, we are making an implicit assumption.
  8. 1) Make the limit explicit. Allows us to define our

    operating boundaries and gives us something to test.
  9. Without an explicit limit, a client could easily doom itself

    by accidentally requesting too much data.
  10. Federal Bridge Gross Weight Formula The federal maximum weight is

    set at 80,000 pounds. Trucks exceeding the federal weight limit can still operate on the country's highways with an overweight permit, but such permits are only issued before the scheduled trip and expire at the end of the trip. Overweight permits are only issued for loads that cannot be broken down to smaller shipments that fall below the federal weight limit, and if there is no other alternative to moving the cargo by truck.
  11. Limits in Practice SQS: 256KB message size limit, 120,000 in-flight

    messages (20,000 for FIFO) Kinesis: 1MB message size limit Kafka: 1MB message size limit NATS: 1MB message size limit GAE pull queues: 1MB message size limit GAE frontend instances: 60-second request deadline Lambda: 128KB event size limit, 300-second request deadline, 400 max concurrent executions
  12. 1) Define your operating boundaries. What are your SLAs, what

    are your capacity needs, what are your cost requirements? 2) Use your operating boundaries to drive your design. Architecture, APIs, business logic, UX fall out from your operating boundaries.