Performing chaos in a serverless world - Serverless Architecture Conference October 14 2020

Performing chaos in a serverless world - Serverless Architecture Conference October 14 2020

Presented as a keynote at Serverless Architecture Conference, October 14th, 2020.

@gunnargrosch
Serverless Chaos Demo
failure-lambda
aws-lambda-chaos-injection

Chaos engineering is the practice of hypothesis testing through planned experiments to gain a better understanding of a system’s behavior. The principles of chaos engineering have been around for years, and we have now reached the point where chaos engineering has gone from just being a buzzword and practice used by a few large organizations in very specific fields, to it being put in to use by companies of all sizes and industries.

Planning and performing chaos experiments on traditional infrastructure with virtual machines and microservices using containers has been battle-tested by many large organizations, but serverless functions and managed services present different failure modes and level of abstraction.

In this talk we focus on how to apply the principles of chaos engineering to serverless, both for serverless functions and managed services. This covers how hypothesis can be formed to fit serverless, what the experiments can achieve and how to practically perform them.

Join as we move from talking about the principles to performing real chaos in a serverless world!

B2fefbb30aba7c25bbe0c8819791631a?s=128

Gunnar Grosch

October 14, 2020
Tweet

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Gunnar Grosch @gunnargrosch October 14, 2020 Performing chaos in a serverless world Serverless Architecture Conference
  2. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Abstract The principles of chaos engineering have been battletested for years using traditional infrastructure and containerized microservices. But how do they work with serverless functions and managed services?
  3. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Agenda • What is chaos engineering? • Motivations behind chaos engineering • Running chaos experiments • Serverless chaos experiments
  4. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. About me Senior Developer Advocate Background in development, operations, and management Community builder Father of three
  5. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering?
  6. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering? Chaos engineering is not about breaking things
  7. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering? Chaos engineering is not only for production
  8. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering? Chaos engineering is not only for big streaming companies
  9. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering? “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  10. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering? Chaos engineering is about performing controlled experiments to inject failures
  11. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering? Chaos engineering is about finding the weaknesses in a system and fixing them before they break
  12. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What is chaos engineering? Chaos engineering is about building confidence in your system and in your organization
  13. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Motivations behind chaos engineering
  14. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Motivations behind chaos engineering Are your customers getting the experience they should? Is downtime or issues costing you money? Are you confident in your monitoring and alerting? Is your organization ready to handle outages? Are you learning from incidents?
  15. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Motivations behind chaos engineering Don’t ask what happens if a system fails; ask what happens when it fails
  16. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Motivations behind chaos engineering “Chaos engineering should be done regularly” Reliability Pillar AWS Well-Architected Framework
  17. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Running chaos experiments
  18. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Step 1: Define steady state The normal behavior of a system over time System metrics and business metrics Business metrics are usually more useful Steady state is not necessarily continuous
  19. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Step 2: Form your hypothesis Use what ifs to find it Chaos can be injected at any layer of the stack Scientific ”If… then…” method Always fix known problems first
  20. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Step 3: Plan and run your experiment Whiteboard the experiment in detail Contain the blast radius Notify the organization Have a “stop” button ready
  21. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Step 4: Measure and learn Use metrics to prove or disprove the hypothesis Was the system resilient to the injected failure? Did anything unexpected happen? Share your progress and success
  22. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Running chaos experiments Define steady state Form your hypothesis Plan and run your experiment Measure and learn
  23. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos experiments
  24. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos experiments Errors Failovers Fallbacks Timeouts Events
  25. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos experiments Inject errors into your code Remove downstream services Alter the concurrency of functions Restrict the capacity of tables Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  26. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos experiments Security policy errors CORS configuration errors Service configuration errors Function disk space failure Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  27. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos experiments Add latency to your functions • Cold starts • Cloud provider issues • Runtime or code issues • Integration issues • Timeouts Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  28. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Failure-lambda NodeJS NPM package for NodeJS Lambdas https://github.com/gunnargrosch/failure-lambda Configuration using Parameter Store Several failure modes • Latency • Status code • Exception • Disk space • Denylist const failureLambda = require('failure-lambda’) exports.handler = failureLambda(async (event, context) => { ... }) { "isEnabled": false, "failureMode": "latency", "rate": 1, "minLatency": 100, "maxLatency": 400, "exceptionMsg": "Exception message!", "statusCode": 404, "diskSpace": 100, “denylist": [ "s3.*.amazonaws.com", "dynamodb.*.amazonaws.com" ] }
  29. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos demo
  30. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos demo
  31. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos demo Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda
  32. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda Serverless chaos demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB?
  33. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Demo
  34. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What’s next?
  35. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What’s next? “Chaos engineering should be done regularly” Reliability Pillar AWS Well-Architected Framework
  36. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. What’s next? “Chaos engineering should be done regularly, and be part of your CI/CD cycle” Reliability Pillar AWS Well-Architected Framework
  37. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos CI/CD demo
  38. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Serverless chaos CI/CD demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB? Default deploy
  39. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Canary deploy Serverless chaos CI/CD demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB?
  40. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Feature flag Serverless chaos CI/CD demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB?
  41. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Summary Chaos engineering helps us find weaknesses and fix them Chaos engineering is about building confidence Chaos engineering should be done regularly Chaos engineering should be part of your CI/CD It’s not rocket science; you can do it!
  42. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Do you want more? Reliability pillar https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html Serverless Chaos Demo app https://demo.serverlesschaos.com Failure-lambda https://github.com/gunnargrosch/failure-lambda Chaos-lambda https://github.com/adhorn/aws-lambda-chaos-injection/ Serverless chaos lab https://github.com/jpbarto/serverless-chaos-lab
  43. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Thank you! Gunnar Grosch @gunnargrosch