Slide 1

Slide 1 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos Engineering: Getting out of the starting blocks Adrian Hornsby Principal Technical Evangelist Amazon Web Services

Slide 2

Slide 2 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. What currently prevents the wide adoption of chaos engineering in your organization?

Slide 3

Slide 3 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 4

Slide 4 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why is production chaos?

Slide 5

Slide 5 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 6

Slide 6 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #0 - DON’T CALL IT CHAOS ENGINEERING.

Slide 7

Slide 7 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #0 - DON’T CALL IT CHAOS ENGINEERING.

Slide 8

Slide 8 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #1 - DON’T FOCUS ON CHAOS ENGINEERING, LOOK AT THE BIGGER PICTURE.

Slide 9

Slide 9 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Good intentions never work [...]

Slide 10

Slide 10 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Because people already had good intentions

Slide 11

Slide 11 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 12

Slide 12 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. If good intentions don’t work, what does?

Slide 13

Slide 13 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Andon Cord

Slide 14

Slide 14 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1902

Slide 15

Slide 15 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Toyota will not allow any defect that they know about to go down the manufacturing line.

Slide 16

Slide 16 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Source: http://www.autoexpress.co.uk/toyota/prius/34615/japanese-earthquake-hits-car-production

Slide 17

Slide 17 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Andon Customer Service

Slide 18

Slide 18 text

• Erroneously listed recharge cable as included • Andon cord pulled and page corrected • Contacts per unit go from 33% to 3.7%

Slide 19

Slide 19 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 20

Slide 20 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. "Good intentions never work, you need good mechanisms to make anything happen." Jeff Bezos

Slide 21

Slide 21 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. People have good intention to start with!

Slide 22

Slide 22 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Good Mechanisms ≈ Complete Processes Tools Adoption Audit

Slide 23

Slide 23 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #2 - CHANGE BEGINS WITH UNDERSTANDING.

Slide 24

Slide 24 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. What are the top 5 “painful” reasons for your fires?

Slide 25

Slide 25 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1. It is always DNS 2. Configuration drift 3. SSL Certificate expiration 4. Deployment failure 5. Failed link to 3rd party provider

Slide 26

Slide 26 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Anatomy of a COE • What happened? • What was the impact on customers and your business? • What were the contributing factors? • What data do you have to support this? • What lessons did you learn? • What corrective actions are you taking?

Slide 27

Slide 27 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Audit Weekly Operational Metrics Review • Continuous inspection mechanism • Maintains focus on operations • Foundation of a healthy operations program Typical Agenda - typically divided into fifteen-minute slots • Share successes and failings • Action items follow up • Review COEs • Review key service metrics • Identify new best practices

Slide 28

Slide 28 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Policy Engine • Automated risk and opportunity analyzer • Identifies potential risks to availability, infrastructure, security and more • Highlights opportunities to optimize resource utilization • Extensible and configurable • Provides a view into policy compliance • Allows acknowledgment • Reports roll-up the organization hierarchy Mechanism to propagate local learnings globally

Slide 29

Slide 29 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #3 - CHOOSE YOUR TROJAN HORSE.

Slide 30

Slide 30 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Find the right team to start with: Not the best (improvements are harder) Not the worse (they have bigger problems)

Slide 31

Slide 31 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Choose the metrics to measure improvement: MTTR is __always__ a good default.

Slide 32

Slide 32 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #4 - OVER-INDEX ON THE HYPOTHESIS.

Slide 33

Slide 33 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEADY STATE HYPOTHESIS RUN EXPERIMENT VERIFY IMPROVE

Slide 34

Slide 34 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 35

Slide 35 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 36

Slide 36 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #5 - INTRODUCE CHAOS ENGINEERING EARLY IN THE JOURNEY.

Slide 37

Slide 37 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Start simple and local!! $ docker stop 94a214bbeebd

Slide 38

Slide 38 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. DDoS yourself $ wrk -t12 -c400 -d30s http://127.0.0.1/api/health

Slide 39

Slide 39 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Burn CPU with Stress(–ng) $ stress-ng --cpu 0--cpu-method matrixprod -t 60s https://kernel.ubuntu.com/~cking/stress-ng/

Slide 40

Slide 40 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adding latency to the network $ tc qdisc add dev eth0 root netem delay 300ms

Slide 41

Slide 41 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Blocks DNS resolution $ iptables -A INPUT -p tcp -m tcp --dport 53 -j DROP

Slide 42

Slide 42 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. #6 - BLAST-RADIUS REDUCTION MINDSET.

Slide 43

Slide 43 text

#7 - IF YOU HAVEN’T VERIFIED IT, IT’S PROBABLY BROKEN.

Slide 44

Slide 44 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Verification: 1. Disaster Recovery & backups 2. Auto scaling 3. Multi-AZ 3. Fault tolerance & self healing 4. People

Slide 45

Slide 45 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 46

Slide 46 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Getting out of the starting blocks.

Slide 47

Slide 47 text

© 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tools Processes Culture Technology

Slide 48

Slide 48 text

Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Adrian Hornsby https://medium.com/@adhorn adhorn