Chaos engineering (not only) for serverless - Cape Town DevOps September 22 2020

B2fefbb30aba7c25bbe0c8819791631a?s=47 Gunnar Grosch
September 22, 2020

Chaos engineering (not only) for serverless - Cape Town DevOps September 22 2020

Presented at Cape Town DevOps, September 22nd, 2020.

@gunnargrosch

Chaos engineering has moved from being a buzzword to something companies are adopting to help verify the output of their systems. In this session we'll cover the motivations behind chaos engineering, how we perform chaos experiments, and run some actual experiments in an environment to see how it can help us build reliable applications. Join as we perform real chaos engineering experiments for serverless and serverful!

B2fefbb30aba7c25bbe0c8819791631a?s=128

Gunnar Grosch

September 22, 2020
Tweet

Transcript

  1. @gunnargrosch Gunnar Grosch September 22, 2020 Chaos engineering (not only)

    for serverless
  2. @gunnargrosch “You can’t consider your workload to be resilient until

    you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results” Seth Eliot AWS Well-Architected
  3. @gunnargrosch Agenda • What is chaos engineering? • Motivations behind

    chaos engineering • Running chaos experiments • Chaos experiments • Challenges with serverless • Serverless chaos experiments
  4. @gunnargrosch About me Background in development, operations, and management Organizer

    of user groups and conferences AWS Serverless Hero Father of three
  5. @gunnargrosch What is chaos engineering?

  6. @gunnargrosch What is chaos engineering? Chaos engineering is not about

    breaking things
  7. @gunnargrosch What is chaos engineering? Chaos engineering is not only

    for production
  8. @gunnargrosch What is chaos engineering? Chaos engineering is not only

    for big streaming companies
  9. @gunnargrosch What is chaos engineering? “Chaos Engineering is the discipline

    of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org
  10. @gunnargrosch What is chaos engineering? Chaos engineering is about performing

    controlled experiments to inject failures
  11. @gunnargrosch What is chaos engineering? Chaos engineering is about finding

    the weaknesses in a system and fixing them before they break
  12. @gunnargrosch What is chaos engineering? Chaos engineering is about building

    confidence in your system and in your organization
  13. @gunnargrosch Motivations behind chaos engineering

  14. @gunnargrosch Motivations behind chaos engineering Are your customers getting the

    experience they should? Is downtime or issues costing you money? Are you confident in your monitoring and alerting? Is your organization ready to handle outages? Are you learning from incidents?
  15. @gunnargrosch Motivations behind chaos engineering Don’t ask what happens if

    a system fails; ask what happens when it fails
  16. @gunnargrosch Motivations behind chaos engineering “Chaos engineering should be done

    regularly” Reliability Pillar AWS Well-Architected Framework
  17. @gunnargrosch Running chaos experiments

  18. @gunnargrosch Step 1: Define steady state The normal behavior of

    a system over time System metrics and business metrics Business metrics are usually more useful Steady state is not necessarily continuous
  19. @gunnargrosch Step 2: Form your hypothesis Use what ifs to

    find it Chaos can be injected at any layer of the stack Scientific ”If… then…” method Always fix known problems first
  20. @gunnargrosch Step 3: Plan and run your experiment Whiteboard the

    experiment in detail Contain the blast radius Notify the organization Have a “stop” button ready
  21. @gunnargrosch Step 4: Measure and learn Use metrics to prove

    or disprove the hypothesis Was the system resilient to the injected failure? Did anything unexpected happen? Share your progress and success
  22. @gunnargrosch Running chaos experiments Define steady state Form your hypothesis

    Plan and run your experiment Measure and learn
  23. @gunnargrosch Chaos experiments

  24. @gunnargrosch Chaos experiments Resource exhaustion Network reliability Datastore saturation Instance

    failure Dependencies Application layer Kubernetes
  25. @gunnargrosch Chaos engineering tools Chaos Monkey Netflix SSM Documents Steadybit

    steadybit.com Gremlin gremlin.com Chaos Toolkit chaostoolkit.org
  26. @gunnargrosch Chaos demo

  27. @gunnargrosch Chaos demo

  28. @gunnargrosch Chaos demo

  29. @gunnargrosch Chaos demo

  30. @gunnargrosch Demo

  31. @gunnargrosch Challenges with serverless

  32. @gunnargrosch Challenges with serverless “Serverless allows you to build and

    run applications and services without thinking about servers” Amazon Web Services (AWS)
  33. @gunnargrosch Challenges with serverless “There are still servers in serverless”

    Serverhuggers
  34. @gunnargrosch Challenges with serverless No servers to manage Less heavy

    lifting Lots of services to choose from Per function and service configuration More granular architectures
  35. @gunnargrosch Serverless chaos experiments

  36. @gunnargrosch Serverless chaos experiments Errors Failovers Fallbacks Timeouts Events

  37. @gunnargrosch Serverless chaos experiments Inject errors into your code Remove

    downstream services Alter the concurrency of functions Restrict the capacity of tables Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  38. @gunnargrosch Serverless chaos experiments Security policy errors CORS configuration errors

    Service configuration errors Function disk space failure Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  39. @gunnargrosch Serverless chaos experiments Add latency to your functions Cold

    starts Cloud provider issues Runtime or code issues Integration issues Timeouts Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)
  40. @gunnargrosch Chaos engineering tools for serverless Chaos-lambda Python Failure-lambda NodeJS

    Failure- azurefunctions NodeJS Failure- cloudfunctions NodeJS
  41. @gunnargrosch Failure-lambda NodeJS NPM package for NodeJS Lambdas https://github.com/gunnargrosch/failure-lambda Configuration

    using Parameter Store Several failure modes Latency Status code Exception Disk space Denylist const failureLambda = require('failure-lambda’) exports.handler = failureLambda(async (event, context) => { ... }) { "isEnabled": false, "failureMode": "latency", "rate": 1, "minLatency": 100, "maxLatency": 400, "exceptionMsg": "Exception message!", "statusCode": 404, "diskSpace": 100, “denylist": [ "s3.*.amazonaws.com", "dynamodb.*.amazonaws.com" ] }
  42. @gunnargrosch Serverless chaos demo

  43. @gunnargrosch Serverless chaos demo

  44. @gunnargrosch Serverless chaos demo Client Amazon S3 Amazon API Gateway

    AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda
  45. @gunnargrosch Client Amazon S3 Amazon API Gateway AWS Lambda Amazon

    DynamoDB AWS Lambda AWS Lambda Serverless chaos demo What if my function takes an extra 300 ms for each invocation? What if my function returns an error code? What if I can’t get data from DynamoDB? Hypothesis: If we inject failure to functions then my application will use graceful degradation.
  46. @gunnargrosch Demo

  47. @gunnargrosch What’s next? “Chaos engineering should be done regularly” Reliability

    Pillar AWS Well-Architected Framework
  48. @gunnargrosch What’s next? “Chaos engineering should be done regularly, and

    be part of your CI/CD cycle” Reliability Pillar AWS Well-Architected Framework
  49. @gunnargrosch Demo

  50. @gunnargrosch Summary Chaos engineering helps us find weaknesses and fix

    them Chaos engineering is about building confidence Chaos engineering should be done regularly What tool you use isn’t the critical part It’s not rocket science; you can do it!
  51. @gunnargrosch Do you want more? Follow @serverlesschaos on Twitter Serverless

    Chaos Demo app: https://demo.serverlesschaos.com Failure-lambda: https://github.com/gunnargrosch/failure-lambda Failure-cloudfunctions: https://github.com/gunnargrosch/failure-cloudfunctions Failure-azurefunctions: https://github.com/gunnargrosch/failure-azurefunctions Chaos-lambda: https://github.com/adhorn/aws-lambda-chaos-injection/ Chaos SSM Documents: https://github.com/adhorn/chaos-ssm-documents YouTube videos and repositories: https://grosch.se
  52. @gunnargrosch Thank you! Gunnar Grosch @gunnargrosch