Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Chaos engineering (not only) for serverless - Cape Town DevOps September 22 2020

Gunnar Grosch
September 22, 2020

Chaos engineering (not only) for serverless - Cape Town DevOps September 22 2020

Presented at Cape Town DevOps, September 22nd, 2020.

@gunnargrosch

Chaos engineering has moved from being a buzzword to something companies are adopting to help verify the output of their systems. In this session we'll cover the motivations behind chaos engineering, how we perform chaos experiments, and run some actual experiments in an environment to see how it can help us build reliable applications. Join as we perform real chaos engineering experiments for serverless and serverful!

Gunnar Grosch

September 22, 2020
Tweet

More Decks by Gunnar Grosch

Other Decks in Technology

Transcript

  1. @gunnargrosch “You can’t consider your workload to be resilient until

    you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results” Seth Eliot AWS Well-Architected
  2. @gunnargrosch Agenda • What is chaos engineering? • Motivations behind

    chaos engineering • Running chaos experiments • Chaos experiments • Challenges with serverless • Serverless chaos experiments
  3. @gunnargrosch About me Background in development, operations, and management Organizer

    of user groups and conferences AWS Serverless Hero Father of three
  4. @gunnargrosch What is chaos engineering? “Chaos Engineering is the discipline

    of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  5. @gunnargrosch What is chaos engineering? Chaos engineering is about finding

    the weaknesses in a system and fixing them before they break
  6. @gunnargrosch What is chaos engineering? Chaos engineering is about building

    confidence in your system and in your organization
  7. @gunnargrosch Motivations behind chaos engineering Are your customers getting the

    experience they should? Is downtime or issues costing you money? Are you confident in your monitoring and alerting? Is your organization ready to handle outages? Are you learning from incidents?
  8. @gunnargrosch Motivations behind chaos engineering “Chaos engineering should be done

    regularly” Reliability Pillar AWS Well-Architected Framework
  9. @gunnargrosch Step 1: Define steady state The normal behavior of

    a system over time System metrics and business metrics Business metrics are usually more useful Steady state is not necessarily continuous
  10. @gunnargrosch Step 2: Form your hypothesis Use what ifs to

    find it Chaos can be injected at any layer of the stack Scientific ”If… then…” method Always fix known problems first
  11. @gunnargrosch Step 3: Plan and run your experiment Whiteboard the

    experiment in detail Contain the blast radius Notify the organization Have a “stop” button ready
  12. @gunnargrosch Step 4: Measure and learn Use metrics to prove

    or disprove the hypothesis Was the system resilient to the injected failure? Did anything unexpected happen? Share your progress and success
  13. @gunnargrosch Chaos engineering tools Chaos Monkey Netflix SSM Documents Steadybit

    steadybit.com Gremlin gremlin.com Chaos Toolkit chaostoolkit.org
  14. @gunnargrosch Challenges with serverless “Serverless allows you to build and

    run applications and services without thinking about servers” Amazon Web Services (AWS)
  15. @gunnargrosch Challenges with serverless No servers to manage Less heavy

    lifting Lots of services to choose from Per function and service configuration More granular architectures
  16. @gunnargrosch Serverless chaos experiments Inject errors into your code Remove

    downstream services Alter the concurrency of functions Restrict the capacity of tables Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  17. @gunnargrosch Serverless chaos experiments Security policy errors CORS configuration errors

    Service configuration errors Function disk space failure Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  18. @gunnargrosch Serverless chaos experiments Add latency to your functions Cold

    starts Cloud provider issues Runtime or code issues Integration issues Timeouts Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  19. @gunnargrosch Chaos engineering tools for serverless Chaos-lambda Python Failure-lambda NodeJS

    Failure- azurefunctions NodeJS Failure- cloudfunctions NodeJS
  20. @gunnargrosch Failure-lambda NodeJS NPM package for NodeJS Lambdas https://github.com/gunnargrosch/failure-lambda Configuration

    using Parameter Store Several failure modes Latency Status code Exception Disk space Denylist const failureLambda = require('failure-lambda’) exports.handler = failureLambda(async (event, context) => { ... }) { "isEnabled": false, "failureMode": "latency", "rate": 1, "minLatency": 100, "maxLatency": 400, "exceptionMsg": "Exception message!", "statusCode": 404, "diskSpace": 100, “denylist": [ "s3.*.amazonaws.com", "dynamodb.*.amazonaws.com" ] }
  21. @gunnargrosch Serverless chaos demo Client Amazon S3 Amazon API Gateway

    AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda
  22. @gunnargrosch Client Amazon S3 Amazon API Gateway AWS Lambda Amazon

    DynamoDB AWS Lambda AWS Lambda Serverless chaos demo What if my function takes an extra 300 ms for each invocation? What if my function returns an error code? What if I can’t get data from DynamoDB? Hypothesis: If we inject failure to functions then my application will use graceful degradation.
  23. @gunnargrosch What’s next? “Chaos engineering should be done regularly, and

    be part of your CI/CD cycle” Reliability Pillar AWS Well-Architected Framework
  24. @gunnargrosch Summary Chaos engineering helps us find weaknesses and fix

    them Chaos engineering is about building confidence Chaos engineering should be done regularly What tool you use isn’t the critical part It’s not rocket science; you can do it!
  25. @gunnargrosch Do you want more? Follow @serverlesschaos on Twitter Serverless

    Chaos Demo app: https://demo.serverlesschaos.com Failure-lambda: https://github.com/gunnargrosch/failure-lambda Failure-cloudfunctions: https://github.com/gunnargrosch/failure-cloudfunctions Failure-azurefunctions: https://github.com/gunnargrosch/failure-azurefunctions Chaos-lambda: https://github.com/adhorn/aws-lambda-chaos-injection/ Chaos SSM Documents: https://github.com/adhorn/chaos-ssm-documents YouTube videos and repositories: https://grosch.se