Performing chaos engineering in a serverless world (CMY301) - AWS re:Invent Las Vegas December 2 2019

Slide 1

Slide 1 text

Slide 2

Slide 2 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. The principles of chaos engineering have been battle- tested for years using traditional infrastructure and containerized microservices. But how do they work with serverless functions and managed services?

Slide 3

Slide 3 text

Agenda What is chaos engineering? Motivations behind chaos engineering Running chaos experiments Challenges with serverless Serverless chaos experiments

Slide 4

Slide 4 text

About me Evangelist and cofounder Opsio Background in development, operations, and management Organizer of user groups and conferences Advocate for serverless and chaos engineering Father of three

Slide 5

Slide 5 text

Slide 6

Slide 6 text

What is chaos engineering? Chaos engineering is not about breaking things

Slide 7

Slide 7 text

What is chaos engineering? Chaos engineering is not resilience engineering

Slide 8

Slide 8 text

What is chaos engineering? Chaos engineering is not only for production

Slide 9

Slide 9 text

What is chaos engineering? Chaos engineering is not only for big streaming companies

Slide 10

Slide 10 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” principlesofchaos.org

Slide 11

Slide 11 text

What is chaos engineering? Chaos engineering is about performing controlled experiments to inject failures

Slide 12

Slide 12 text

What is chaos engineering? Chaos engineering is about finding the weaknesses in a system and fixing them before they break

Slide 13

Slide 13 text

What is chaos engineering? Chaos engineering is about building confidence in your system and in your organization

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Motivations behind chaos engineering Are your customers getting the experience they should? Is downtime or issues costing you money? Are you confident in your monitoring and alerting? Is your organization ready to handle outages? Are you learning from incidents?

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Motivations behind chaos engineering Don’t ask what happens if a system fails; ask what happens when it fails

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Step 1: Define steady state The normal behavior of a system over time Business metrics are usually more useful Steady state is not necessarily continuous System metrics and business metrics

Slide 22

Slide 22 text

Step 2: Form your hypothesis Use what ifs to find it Scientific ”If… then…” method Always fix known problems first Chaos can be injected at any layer of the stack

Slide 23

Slide 23 text

Step 3: Plan and run your experiment Whiteboard the experiment in detail Notify the organization Have a “stop” button ready Contain the blast radius

Slide 24

Slide 24 text

Step 4: Measure and learn Use metrics to prove or disprove the hypothesis Did anything unexpected happen? Share your progress and success Was the system resilient to the injected failure?

Slide 25

Slide 25 text

Step 5: Scale up or abort and fix Use the learnings to improve Increased scope can reveal new effects With confidence you can scale up

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Challenges with serverless “Serverless allows you to build and run applications and services without thinking about servers”

Slide 28

Slide 28 text

Challenges with serverless “without thinking about servers”

Slide 29

Slide 29 text

Challenges with serverless No servers to manage Less heavy lifting Lots of services to choose from Per function and service configuration More granular architectures

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Common serverless weaknesses

Slide 33

Slide 33 text

Serverless chaos experiments Inject errors into your code Remove downstream services Alter the concurrency of functions Restrict the capacity of tables Client Amazon Simple Storage Service (Amazon S3) Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda

Slide 34

Slide 34 text

Serverless chaos experiments Security policy errors CORS configuration errors Service configuration errors Function disk space failure Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda

Slide 35

Slide 35 text

Serverless chaos experiments Add latency to your functions • Cold starts • Cloud provider issues • Runtime or code issues • Integration issues • Timeouts Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Serverless chaos demo

Slide 38

Slide 38 text

Serverless chaos demo Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda

Slide 39

Slide 39 text

Client Amazon S3 Amazon API Gateway AWS Lambda Amazon DynamoDB AWS Lambda AWS Lambda Serverless chaos demo What if my function takes an extra 300 ms for each invocation? What if my function returns an error code? What if there is an exception in the code? Hypothesis: If we inject failure to functions then my application will use graceful degradation.

Slide 40

Slide 40 text

Slide 41

Slide 41 text

Summary Everything fails, all the time Serverless doesn’t make your application resilient Chaos engineering helps us find weaknesses and fix them Chaos engineering is about building confidence It’s not rocket science; you can do it!

Slide 42

Slide 42 text

Do you want more? Follow @gunnargrosch and @serverlesschaos on Twitter Try the Serverless Chaos Demo app: https://demo.serverlesschaos.com YouTube videos and repositories: https://grosch.se Join the chaos engineering slack: http://bit.ly/chaos-eng-slack Visit chaos engineering meetups