Slide 1

Slide 1 text

@gunnargrosch Gunnar Grosch September 22, 2020 Chaos engineering (not only) for serverless

Slide 2

Slide 2 text

@gunnargrosch “You can’t consider your workload to be resilient until you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results” Seth Eliot AWS Well-Architected

Slide 3

Slide 3 text

@gunnargrosch Agenda • What is chaos engineering? • Motivations behind chaos engineering • Running chaos experiments • Chaos experiments • Challenges with serverless • Serverless chaos experiments

Slide 4

Slide 4 text

@gunnargrosch About me Background in development, operations, and management Organizer of user groups and conferences AWS Serverless Hero Father of three

Slide 5

Slide 5 text

@gunnargrosch What is chaos engineering?

Slide 6

Slide 6 text

@gunnargrosch What is chaos engineering? Chaos engineering is not about breaking things

Slide 7

Slide 7 text

@gunnargrosch What is chaos engineering? Chaos engineering is not only for production

Slide 8

Slide 8 text

@gunnargrosch What is chaos engineering? Chaos engineering is not only for big streaming companies

Slide 9

Slide 9 text

@gunnargrosch What is chaos engineering? “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production” principlesofchaos.org

Slide 10

Slide 10 text

@gunnargrosch What is chaos engineering? Chaos engineering is about performing controlled experiments to inject failures

Slide 11

Slide 11 text

@gunnargrosch What is chaos engineering? Chaos engineering is about finding the weaknesses in a system and fixing them before they break

Slide 12

Slide 12 text

@gunnargrosch What is chaos engineering? Chaos engineering is about building confidence in your system and in your organization

Slide 13

Slide 13 text

@gunnargrosch Motivations behind chaos engineering

Slide 14

Slide 14 text

@gunnargrosch Motivations behind chaos engineering Are your customers getting the experience they should? Is downtime or issues costing you money? Are you confident in your monitoring and alerting? Is your organization ready to handle outages? Are you learning from incidents?

Slide 15

Slide 15 text

@gunnargrosch Motivations behind chaos engineering Don’t ask what happens if a system fails; ask what happens when it fails

Slide 16

Slide 16 text

@gunnargrosch Motivations behind chaos engineering “Chaos engineering should be done regularly” Reliability Pillar AWS Well-Architected Framework

Slide 17

Slide 17 text

@gunnargrosch Running chaos experiments

Slide 18

Slide 18 text

@gunnargrosch Step 1: Define steady state The normal behavior of a system over time System metrics and business metrics Business metrics are usually more useful Steady state is not necessarily continuous

Slide 19

Slide 19 text

@gunnargrosch Step 2: Form your hypothesis Use what ifs to find it Chaos can be injected at any layer of the stack Scientific ”If… then…” method Always fix known problems first

Slide 20

Slide 20 text

@gunnargrosch Step 3: Plan and run your experiment Whiteboard the experiment in detail Contain the blast radius Notify the organization Have a “stop” button ready

Slide 21

Slide 21 text

@gunnargrosch Step 4: Measure and learn Use metrics to prove or disprove the hypothesis Was the system resilient to the injected failure? Did anything unexpected happen? Share your progress and success

Slide 22

Slide 22 text

@gunnargrosch Running chaos experiments Define steady state Form your hypothesis Plan and run your experiment Measure and learn

Slide 23

Slide 23 text

@gunnargrosch Chaos experiments

Slide 24

Slide 24 text

@gunnargrosch Chaos experiments Resource exhaustion Network reliability Datastore saturation Instance failure Dependencies Application layer Kubernetes

Slide 25

Slide 25 text

@gunnargrosch Chaos engineering tools Chaos Monkey Netflix SSM Documents Steadybit steadybit.com Gremlin gremlin.com Chaos Toolkit chaostoolkit.org

Slide 26

Slide 26 text

@gunnargrosch Chaos demo

Slide 27

Slide 27 text

@gunnargrosch Chaos demo

Slide 28

Slide 28 text

@gunnargrosch Chaos demo

Slide 29

Slide 29 text

@gunnargrosch Chaos demo

Slide 30

Slide 30 text

@gunnargrosch Demo

Slide 31

Slide 31 text

@gunnargrosch Challenges with serverless

Slide 32

Slide 32 text

@gunnargrosch Challenges with serverless “Serverless allows you to build and run applications and services without thinking about servers” Amazon Web Services (AWS)

Slide 33

Slide 33 text

@gunnargrosch Challenges with serverless “There are still servers in serverless” Serverhuggers

Slide 34

Slide 34 text

@gunnargrosch Challenges with serverless No servers to manage Less heavy lifting Lots of services to choose from Per function and service configuration More granular architectures

Slide 35

Slide 35 text

@gunnargrosch Serverless chaos experiments

Slide 36

Slide 36 text

@gunnargrosch Serverless chaos experiments Errors Failovers Fallbacks Timeouts Events

Slide 37

Slide 37 text

@gunnargrosch Serverless chaos experiments Inject errors into your code Remove downstream services Alter the concurrency of functions Restrict the capacity of tables Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)

Slide 38

Slide 38 text

@gunnargrosch Serverless chaos experiments Security policy errors CORS configuration errors Service configuration errors Function disk space failure Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)

Slide 39

Slide 39 text

@gunnargrosch Serverless chaos experiments Add latency to your functions Cold starts Cloud provider issues Runtime or code issues Integration issues Timeouts Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda Amazon Simple Storage Service (Amazon S3)

Slide 40

Slide 40 text

@gunnargrosch Chaos engineering tools for serverless Chaos-lambda Python Failure-lambda NodeJS Failure- azurefunctions NodeJS Failure- cloudfunctions NodeJS

Slide 41

Slide 41 text

@gunnargrosch Failure-lambda NodeJS NPM package for NodeJS Lambdas https://github.com/gunnargrosch/failure-lambda Configuration using Parameter Store Several failure modes Latency Status code Exception Disk space Denylist const failureLambda = require('failure-lambda’) exports.handler = failureLambda(async (event, context) => { ... }) { "isEnabled": false, "failureMode": "latency", "rate": 1, "minLatency": 100, "maxLatency": 400, "exceptionMsg": "Exception message!", "statusCode": 404, "diskSpace": 100, “denylist": [ "s3.*.amazonaws.com", "dynamodb.*.amazonaws.com" ] }

Slide 42

Slide 42 text

@gunnargrosch Serverless chaos demo

Slide 43

Slide 43 text

@gunnargrosch Serverless chaos demo

Slide 44

Slide 44 text

@gunnargrosch Serverless chaos demo Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda

Slide 45

Slide 45 text

@gunnargrosch Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda Serverless chaos demo What if my function takes an extra 300 ms for each invocation? What if my function returns an error code? What if I can’t get data from DynamoDB? Hypothesis: If we inject failure to functions then my application will use graceful degradation.

Slide 46

Slide 46 text

@gunnargrosch Demo

Slide 47

Slide 47 text

@gunnargrosch What’s next? “Chaos engineering should be done regularly” Reliability Pillar AWS Well-Architected Framework

Slide 48

Slide 48 text

@gunnargrosch What’s next? “Chaos engineering should be done regularly, and be part of your CI/CD cycle” Reliability Pillar AWS Well-Architected Framework

Slide 49

Slide 49 text

@gunnargrosch Demo

Slide 50

Slide 50 text

@gunnargrosch Summary Chaos engineering helps us find weaknesses and fix them Chaos engineering is about building confidence Chaos engineering should be done regularly What tool you use isn’t the critical part It’s not rocket science; you can do it!

Slide 51

Slide 51 text

@gunnargrosch Do you want more? Follow @serverlesschaos on Twitter Serverless Chaos Demo app: https://demo.serverlesschaos.com Failure-lambda: https://github.com/gunnargrosch/failure-lambda Failure-cloudfunctions: https://github.com/gunnargrosch/failure-cloudfunctions Failure-azurefunctions: https://github.com/gunnargrosch/failure-azurefunctions Chaos-lambda: https://github.com/adhorn/aws-lambda-chaos-injection/ Chaos SSM Documents: https://github.com/adhorn/chaos-ssm-documents YouTube videos and repositories: https://grosch.se

Slide 52

Slide 52 text

@gunnargrosch Thank you! Gunnar Grosch @gunnargrosch