Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Applying Chaos Engineering to build resilient serverless applications

Applying Chaos Engineering to build resilient serverless applications

Serverless applications are the epitome of highly distributed, microservices applications. Execution happens everywhere - both inside and outside the serverless compute environment. For example, your functions could be triggered by an external service, then execute some code within AWS Lambda, then send a request over to a database, which *then* requires AWS Lambda to perform an update in a second data store.

You might be able to predict and design for certain troublesome issues but there are many, many more that you probably will not be able to easily plan for. How do you build a resilient system under these highly distributed circumstances? The answer is chaos engineering.

Join us as we walkthrough:
The unique challenges of building a highly resilient serverless app
Why you need to design for problems you cannot predict and cannot easily test for
How you can use chaos engineering to build a resilient serverless application
How you can take advantage of out of the box and third-party observability solutions to measure the impact of chaos experiments.

Emrah Samdan

October 08, 2019
Tweet

More Decks by Emrah Samdan

Other Decks in Programming

Transcript

  1. Who am I? • Developer for 6+ years • Product

    guy for 2 years • VP of Product for Thundra • Organizing committee
  2. @emrahsamdan Agenda • What’s chaos engineering? • Why chaos testing

    on serverless? • Best practices on chaos testing for serverless • How to apply chaos testing on AWS Lambda • How to apply silence in a world of chaos
  3. @emrahsamdan Why chaos engineering? Unit Tests • My function is

    running properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!
  4. @emrahsamdan Why chaos engineering? Unit Tests • My function is

    running properly and meets the expectations. Integration Tests • My system is running properly and meets the expectations. UI/UX Tests • It is like a charm!
  5. @emrahsamdan Chaos Engineering is the discipline of experimenting on a

    system in order to build confidence in the system’s capability to withstand turbulent conditions in production. http://principlesofchaos.org/
  6. @emrahsamdan Chaos Engineering is not For breaking down For bad

    surprises For blaming For causing outages
  7. @emrahsamdan States of chaos engineering • Define steady state •

    Hypothesis on steady state of the system with the designed failure • Run your experiment ◦ Define blast radius ◦ Define halting condition ◦ Have a rollback plan! • Verify & Learn ◦ If your system breaks you understood an issue before it causes an outage. Go fix it! ◦ If it is resilient, congrats! Now, inject some other failure!
  8. @emrahsamdan Don’t break on purpose! • Start experimenting with the

    first row, the leftmost cell: Known-knowns. • Blast radius: The effect will make the smallest effect. • Put a stop button somewhere! • Plan how you learn. • You don’t need to do it on production for the first time. • The most important Let the other people know! Surprising chaos is not funny. No, at all!
  9. @emrahsamdan Chaos examples • Your system keeps records on the

    DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible.
  10. @emrahsamdan Chaos examples • Your system keeps records on the

    DB. • DB is returning too slow for 1% of your customers. Hypothesis: The system won’t experience an outage when DB is hardly accessible. Result: People experiences timeouts while waiting for results.
  11. @emrahsamdan Every service has its own failure mode Lots of

    managed intermediate service which has its own bad-day characteristics. Different throttling, different retry mechanisms for different services.
  12. @emrahsamdan Injecting latency • Don’t attack your system. • You

    don’t need to do on prod first. • There is no point to inject latency to async calls. Hypothesis: Entry point Lambda will degrade gracefully when the downstream Lambda times out or turns really late.
  13. @emrahsamdan Common fixes • Exponential backoff • Properly tunes timeouts

    • Circuit breakers • Use async communication when possible
  14. @emrahsamdan Don’t forget! Aim is • Not to break but

    to improve • Not to blame people but to give them room to fix • Not to surprise your colleagues but to make your system resilient