Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying Chaos Engineering to build resilient serverless applications

Emrah Samdan
November 06, 2019

Applying Chaos Engineering to build resilient serverless applications

Emrah Samdan

November 06, 2019
Tweet

More Decks by Emrah Samdan

Other Decks in Programming

Transcript

  1. Who am I? • Developer for 6+ years • Product

    guy for 2 years • VP of Product for Thundra • Organizing committee • Father of a chaos monkey
  2. @emrahsamdan Why chaos engineering? Unit Tests • My function is

    running properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!
  3. @emrahsamdan Why chaos engineering? Unit Tests • My function is

    running properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!
  4. @emrahsamdan Chaos Engineering is the discipline of experimenting on a

    system in order to build confidence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/
  5. @emrahsamdan Chaos Engineering is not For breaking down For bad

    surprises For blaming For causing outages
  6. @emrahsamdan Ups and downs of the system. If I take

    this server down, maybe everything will still run smooth. Maybe? Let me attack on my system! Cute! Let me break something else. Oh! I should fix this before it actually happens and then break something else.
  7. @emrahsamdan Don’t break on purpose! • Start experimenting with the

    first row, the leftmost cell: Known-knowns. • Blast radius: The effect will make the smallest effect. • Put a stop button somewhere! • Plan how you learn. • You don’t need to do it on production for the first time. • The most important Let the other people know! Surprising chaos is not funny. No, at all!
  8. @emrahsamdan Chaos examples • Your system keeps records on the

    DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible.
  9. @emrahsamdan Chaos examples • Your system keeps records on the

    DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible. Result: People experiences timeouts while waiting for results.
  10. @emrahsamdan Every service has its own failure mode Lots of

    managed intermediate service which has its own bad-day characteristics. Different throttling, different retry mechanisms for different services.
  11. @emrahsamdan Injecting latency • Don’t attack your system. • You

    don’t need to do on prod first. • There is no point to inject latency to async calls. Hypothesis: Entry point Lambda will degrade gracefully when the downstream Lambda times out or turns really late.
  12. @emrahsamdan Common fixes • Exponential backoff • Properly tunes timeouts

    • Circuit breakers • Use async communication when possible
  13. @emrahsamdan Don’t forget! Aim is • Not to break but

    to improve • Not to blame people but to give them room to fix • Not to surprise your colleagues but to make your system resilient