Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying Chaos Engineering to build resilient serverless applications

Applying Chaos Engineering to build resilient serverless applications

With serverless applications, execution can happen everywhere. It’s hard to predict and design for all troublesome issues. Chaos engineering can help you build highly resilient systems. I tried to show how crucial is chaos engineering for serverless and how it can be applied into serverless architectures with examples.

Emrah Samdan

April 25, 2019
Tweet

More Decks by Emrah Samdan

Other Decks in Programming

Transcript

  1. Who am I? • Developer for 6+ years • Product

    guy for 2 years • VP of Product for Thundra • Organizing committee • Serverlessdays İstanbul On October 11st!
  2. Agenda • What’s chaos engineering? • Why chaos testing on

    serverless? • Best practices on chaos testing for serverless • How to apply chaos testing on AWS Lambda • How to apply silence in a world of chaos
  3. Why chaos engineering? Unit Tests • My function is running

    properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!
  4. Why chaos engineering? Unit Tests • My function is running

    properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!
  5. Chaos Engineering is the discipline of experimenting on a system

    in order to build confidence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/
  6. Chaos Engineering is • Like injecting vaccine to your system

    to make it more immune • To improve your system’s resilience by uncovering weaknesses. • Identifying failures before they become outages. • To understand the steady state of your system and challenge it.
  7. Chaos Engineering is not • Breaking down production for purpose.

    • For blaming a group of people. • Surprising your colleagues with partial outages. • Taking down all the system at the same time.
  8. States of chaos engineering • Define steady state • Hypothesis

    on steady state of the system with the designed failure • Run your experiment ◦ Define blast radius ◦ Define halting condition ◦ Have a rollback plan! • Verify & Learn ◦ If your system breaks you understood an issue before it causes an outage. Go fix it! ◦ If it is resilient, congrats! Now, inject some other failure!
  9. Don’t break on purpose! • Start experimenting with the first

    row, the leftmost cell: Known-knowns. • Blast radius: The effect will make the smallest effect. • Put a stop button somewhere! • Plan how you learn. • You don’t need to do it on production for the first time. • The most important Let the other people know! Surprising chaos is not funny. No, at all!
  10. Chaos examples • Your system keeps records on the DB.

    • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible.
  11. Chaos examples • Your system keeps records on the DB.

    • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible. Result: People experiences timeouts while waiting for results.
  12. Every service has its own failure mode Lots of managed

    intermediate service which has its own bad-day characteristics. Different throttling, different retry mechanisms for different services.
  13. Injecting latency • Don’t attack your system. • You don’t

    need to do on prod first. • There is no point to inject latency to async calls. Hypothesis: Entry point Lambda will degrade gracefully when the downstream Lambda times out or turns really late.
  14. Injecting Error • Connection errors with third party services •

    Cache down • AWS Resource is unreachable
  15. Common fixes • Exponential backoff • Properly tunes timeouts •

    Circuit breakers • Use async communication when possible
  16. Don’t forget! Aim is • Not to break but to

    improve • Not to blame people but to give them room to fix • Not to surprise your colleagues but to make your system resilient