Slide 1

Slide 1 text

Chaos Engineering on AWS Improve resiliency and performance with controlled chaos Engineering

Slide 2

Slide 2 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Improve resiliency and performance with controlled chaos Engineering T e c h T a l k : Veliswa Boya Senior Developer Advocate, EMEA (Sub-Saharan Africa) Amazon Web Services

Slide 3

Slide 3 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda Challenges with distributed systems What is chaos engineering and why it is hard Introducing AWS Fault Injection Simulator (FIS) Key features and use cases AWS FIS demo walk-through

Slide 4

Slide 4 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. What’s your story?

Slide 5

Slide 5 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering

Slide 6

Slide 6 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Challenges with distributed systems

Slide 7

Slide 7 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed systems are complex Message Message Reply Reply Server Network Client

Slide 8

Slide 8 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Traditional testing is not enough Unit testing of components Tested in isolation to ensure function meets expectations Functional testing of integrations Each execution path tested to assure expected results TESTING = VERIFYING A KNOWN CONDITION

Slide 9

Slide 9 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. S O I T R E S S B S E R V E M P R O V E Chaos Improve resilience and performance Uncover hidden issues Expose blind spots Monitoring, observability, and alarm And more

Slide 10

Slide 10 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Phases of chaos engineering Steady state Hypothesis Run experiment Verify Improve https://docs.aws.amazon.com/fis/latest/userguide/getting-started-planning.html

Slide 11

Slide 11 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why is chaos engineering difficult? Difficult to ensure safety Stitch together different tools and homemade scripts 1 Agents or libraries required to get started 3 2 Difficult to reproduce “real-world” events (multiple failures at once) 4

Slide 12

Slide 12 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Fault Injection Simulator

Slide 13

Slide 13 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. chaos engineering Safeguards Real-world conditions Easy to get started

Slide 14

Slide 14 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. No need to integrate multiple tools and homemade scripts or install agents Use the AWS Management Console, AWS API or the AWS CLI Use pre-existing experiment templates and get started in minutes Easily share it with others Easy to get started

Slide 15

Slide 15 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Real-world conditions Run experiments in sequence of events or in parallel Target all levels of the system (host, infrastructure, network, etc.) Real faults injected at the service control plane level!

Slide 16

Slide 16 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Safeguards “Stop conditions” alarms Integration with Amazon CloudWatch Built-in rollbacks Fine-grain IAM controls

Slide 17

Slide 17 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Supported targets for fault injections Amazon EC2 instances Amazon ECS API throttling Amazon EKS Amazon RDS And more to come………

Slide 18

Slide 18 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Fault Injection Simulator O V E R V I E W AWS Fault Injection Simulator Experiment template AWS Command Line Interface AWS Management Console AWS Identity and Access Management FIS safeguards FIS engine Compute Start experiment Third party AWS Amazon EventBridge Amazon CloudWatch alarms AWS resources Databases Networking Storage Compute Monitoring Stop experiment

Slide 19

Slide 19 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Creating an experiment template ❶

Slide 20

Slide 20 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Creating an experiment template ❷

Slide 21

Slide 21 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Stop instance

Slide 22

Slide 22 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo Stop instance with alarms

Slide 23

Slide 23 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Fault Injection Simulator U S E C A S E – A S P A R T O F A C I / C D P I P E L I N E

Slide 24

Slide 24 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. References https://aws.amazon.com/builders-library/challenges- with-distributed-systems/ https://docs.aws.amazon.com/fis/latest/userguide/what- is.html https://docs.aws.amazon.com/fis/latest/userguide/getting-started-tutorial.html https://docs.aws.amazon.com/fis/latest/userguide/getting-started-iam.html https://medium.com/the-cloud-architect/chaos-engineering- ab0cc9fbd12a

Slide 25

Slide 25 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.