Slide 1

Slide 1 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Gunnar Grosch @gunnargrosch Improve resiliency and performance with controlled chaos engineering

Slide 2

Slide 2 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Agenda • Challenges with distributed systems • Why is chaos engineering hard? • Introducing AWS Fault Injection Simulator (FIS) • Key features and use cases • Automated chaos experiments • Multiple demos along the way

Slide 3

Slide 3 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Challenges with distributed systems

Slide 4

Slide 4 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Distributed systems are complex Message Message Reply Reply Server Network Client https://aws.amazon.com/builders-library/challenges-with-distributed-systems/

Slide 5

Slide 5 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Traditional testing is not enough TESTING = VERIFYING A KNOWN CONDITION Unit testing of components Tested in isolation to ensure function meets expectations Functional testing of integrations Each execution path tested to assure expected results

Slide 6

Slide 6 text

© 2021, Amazon Web Services, Inc. or its Affiliates. And it can get more complicated… IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device close failed in file object destructor: IOError: No space left on device logfile ROTATE logfile.0 ROTATE logfile.1 ROTATE logfile.2 ROTATE logfile.3 ROTATE logfile.n ROTATE

Slide 7

Slide 7 text

© 2021, Amazon Web Services, Inc. or its Affiliates. S O I T R E S S B S E R V E M P R O V E Chaos engineering Improve resilience and performance Uncover hidden issues Expose blind spots Monitoring, observability, and alarm And more

Slide 8

Slide 8 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Phases of chaos engineering Steady state Hypothesis Run experiment Verify Improve

Slide 9

Slide 9 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Why is chaos engineering difficult? Difficult to ensure safety Stitch together different tools and homemade scripts 1 Agents or libraries required to get started 3 2 Difficult to reproduce “real-world” events (multiple failures at once) 4

Slide 10

Slide 10 text

© 2021, Amazon Web Services, Inc. or its Affiliates. AWS Fault Injection Simulator Safeguards Real-world conditions Easy to get started Fully managed chaos engineering service

Slide 11

Slide 11 text

© 2021, Amazon Web Services, Inc. or its Affiliates. No need to integrate multiple tools and homemade scripts or install agents Use the AWS Management Console or the AWS CLI Use pre-existing experiment templates and get started in minutes Easily share it with others Easy to get started

Slide 12

Slide 12 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Real-world conditions Run experiments in sequence of events or in parallel Target all levels of the system (host, infrastructure, network, etc.) Real faults injected at the service control plane level!

Slide 13

Slide 13 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Safeguards “Stop conditions” alarms Integration with Amazon CloudWatch Built-in rollbacks Fine-grain IAM controls

Slide 14

Slide 14 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Components Experiment templates Experiments Actions Targets

Slide 15

Slide 15 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Actions are the fault injection actions executed during an experiment aws:: Actions include: • Fault type • Targeted resources • Timing relative to any other actions • Fault-specific parameters, such as duration, rollback behavior, or the portion of requests to throttle Actions

Slide 16

Slide 16 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Targets define one or more AWS resources on which to carry out an action Targets include: • Resource type • Resource IDs, tags, and filters • Selection mode (e.g., ALL, RANDOM) Targets

Slide 17

Slide 17 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Experiment templates define an experiment and are used in the start-experiment request Experiment templates include: • Actions • Targets • Stop condition alarms • IAM role • Description • Tags Experiment templates

Slide 18

Slide 18 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Experiments are snapshot of the experiment template when it was first launched with couple additions Experiments include: • Snapshot of the experiment • Creation and start time • Status • Execution ID • Experiment template ID • IAM role ARN Experiments

Slide 19

Slide 19 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Use cases

Slide 20

Slide 20 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Use cases One-off experiments Periodic game days Automated experiments

Slide 21

Slide 21 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Use cases One-off experiments Periodic game days Automated experiments

Slide 22

Slide 22 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Use cases One-off experiments Periodic game days Automated experiments

Slide 23

Slide 23 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Use cases One-off experiments Periodic game days Automated experiments

Slide 24

Slide 24 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Automated experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments

Slide 25

Slide 25 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Automated experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments

Slide 26

Slide 26 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Automated experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments

Slide 27

Slide 27 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Automated experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments

Slide 28

Slide 28 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Automated experiments Recurring scheduled experiments Event-triggered experiments Continuous delivery experiments

Slide 29

Slide 29 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Resources AWS Well-Architected Framework https://aws.amazon.com/architecture/well-architected/ AWS Fault Injection Simulator https://aws.amazon.com/fis/ AWS FIS Documentation https://docs.aws.amazon.com/fis/ AWS FIS Samples https://github.com/aws-samples/aws-fault-injection-simulator-samples

Slide 30

Slide 30 text

© 2021, Amazon Web Services, Inc. or its Affiliates. Thank you! Gunnar Grosch @gunnargrosch