Slide 1

Slide 1 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Gunnar Grosch @gunnargrosch October 14, 2020 Performing chaos in a serverless world Serverless Architecture Conference

Slide 2

Slide 2 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Abstract The principles of chaos engineering have been battletested for years using traditional infrastructure and containerized microservices. But how do they work with serverless functions and managed services?

Slide 3

Slide 3 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • What is chaos engineering? • Motivations behind chaos engineering • Running chaos experiments • Serverless chaos experiments

Slide 4

Slide 4 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. About me Senior Developer Advocate Background in development, operations, and management Community builder Father of three

Slide 5

Slide 5 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering?

Slide 6

Slide 6 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering? Chaos engineering is not about breaking things

Slide 7

Slide 7 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering? Chaos engineering is not only for production

Slide 8

Slide 8 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering? Chaos engineering is not only for big streaming companies

Slide 9

Slide 9 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering? “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org

Slide 10

Slide 10 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering? Chaos engineering is about performing controlled experiments to inject failures

Slide 11

Slide 11 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering? Chaos engineering is about finding the weaknesses in a system and fixing them before they break

Slide 12

Slide 12 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is chaos engineering? Chaos engineering is about building confidence in your system and in your organization

Slide 13

Slide 13 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivations behind chaos engineering

Slide 14

Slide 14 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivations behind chaos engineering Are your customers getting the experience they should? Is downtime or issues costing you money? Are you confident in your monitoring and alerting? Is your organization ready to handle outages? Are you learning from incidents?

Slide 15

Slide 15 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivations behind chaos engineering Don’t ask what happens if a system fails; ask what happens when it fails

Slide 16

Slide 16 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivations behind chaos engineering “Chaos engineering should be done regularly” Reliability Pillar AWS Well-Architected Framework

Slide 17

Slide 17 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Running chaos experiments

Slide 18

Slide 18 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Step 1: Define steady state The normal behavior of a system over time System metrics and business metrics Business metrics are usually more useful Steady state is not necessarily continuous

Slide 19

Slide 19 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Step 2: Form your hypothesis Use what ifs to find it Chaos can be injected at any layer of the stack Scientific ”If… then…” method Always fix known problems first

Slide 20

Slide 20 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Step 3: Plan and run your experiment Whiteboard the experiment in detail Contain the blast radius Notify the organization Have a “stop” button ready

Slide 21

Slide 21 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Step 4: Measure and learn Use metrics to prove or disprove the hypothesis Was the system resilient to the injected failure? Did anything unexpected happen? Share your progress and success

Slide 22

Slide 22 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Running chaos experiments Define steady state Form your hypothesis Plan and run your experiment Measure and learn

Slide 23

Slide 23 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos experiments

Slide 24

Slide 24 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos experiments Errors Failovers Fallbacks Timeouts Events

Slide 25

Slide 25 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos experiments Inject errors into your code Remove downstream services Alter the concurrency of functions Restrict the capacity of tables Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)

Slide 26

Slide 26 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos experiments Security policy errors CORS configuration errors Service configuration errors Function disk space failure Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)

Slide 27

Slide 27 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos experiments Add latency to your functions • Cold starts • Cloud provider issues • Runtime or code issues • Integration issues • Timeouts Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)

Slide 28

Slide 28 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Failure-lambda NodeJS NPM package for NodeJS Lambdas https://github.com/gunnargrosch/failure-lambda Configuration using Parameter Store Several failure modes • Latency • Status code • Exception • Disk space • Denylist const failureLambda = require('failure-lambda’) exports.handler = failureLambda(async (event, context) => { ... }) { "isEnabled": false, "failureMode": "latency", "rate": 1, "minLatency": 100, "maxLatency": 400, "exceptionMsg": "Exception message!", "statusCode": 404, "diskSpace": 100, “denylist": [ "s3.*.amazonaws.com", "dynamodb.*.amazonaws.com" ] }

Slide 29

Slide 29 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos demo

Slide 30

Slide 30 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos demo

Slide 31

Slide 31 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos demo Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda

Slide 32

Slide 32 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda Serverless chaos demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB?

Slide 33

Slide 33 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo

Slide 34

Slide 34 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s next?

Slide 35

Slide 35 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s next? “Chaos engineering should be done regularly” Reliability Pillar AWS Well-Architected Framework

Slide 36

Slide 36 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What’s next? “Chaos engineering should be done regularly, and be part of your CI/CD cycle” Reliability Pillar AWS Well-Architected Framework

Slide 37

Slide 37 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos CI/CD demo

Slide 38

Slide 38 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless chaos CI/CD demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB? Default deploy

Slide 39

Slide 39 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Canary deploy Serverless chaos CI/CD demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB?

Slide 40

Slide 40 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Feature flag Serverless chaos CI/CD demo • What if my function takes an extra 300 ms for each invocation? • What if my function returns an error code? • What if I can’t get data from DynamoDB?

Slide 41

Slide 41 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary Chaos engineering helps us find weaknesses and fix them Chaos engineering is about building confidence Chaos engineering should be done regularly Chaos engineering should be part of your CI/CD It’s not rocket science; you can do it!

Slide 42

Slide 42 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Do you want more? Reliability pillar https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html Serverless Chaos Demo app https://demo.serverlesschaos.com Failure-lambda https://github.com/gunnargrosch/failure-lambda Chaos-lambda https://github.com/adhorn/aws-lambda-chaos-injection/ Serverless chaos lab https://github.com/jpbarto/serverless-chaos-lab

Slide 43

Slide 43 text

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! Gunnar Grosch @gunnargrosch