Slide 1

Slide 1 text

Chaos Engineering: When The Network Breaks Tammy Butow Principal SRE, Gremlin @tammybutow

Slide 2

Slide 2 text

Every system is becoming a distributed system. THE PROBLEM

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness in our systems.

Slide 5

Slide 5 text

Inject something harmful to build an immunity.

Slide 6

Slide 6 text

We test proactively, instead of waiting for an outage.

Slide 7

Slide 7 text

Define the Blast Radius

Slide 8

Slide 8 text

What is value of Chaos Engineering?

Slide 9

Slide 9 text

Improved Incident Management

Slide 10

Slide 10 text

Fire drills prepare us to respond quickly, calmly, and safely.

Slide 11

Slide 11 text

Measuring the Cost of Downtime Cost = R + E + C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min

Slide 12

Slide 12 text

Network Chaos

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Network Chaos Engineering Demos 01 02 Latency Injection Packet Loss 03 Blackhole

Slide 15

Slide 15 text

Hipster Shop Architecture

Slide 16

Slide 16 text

35.238.163.103 Hipster Shop Demo

Slide 17

Slide 17 text

Latency Injection Demo

Slide 18

Slide 18 text

Hipster Shop Datadog Latency Attack 1 Container Experiment #4 payments 200 ms HTTP 400/500 errors

Slide 19

Slide 19 text

Latency Attack on Payment Container on AWS EKS

Slide 20

Slide 20 text

Hipster Shop Datadog Latency 1 instance Experiment #5 1 instance 200 ms HTTP 400/500 errors

Slide 21

Slide 21 text

Latency Attack 1 instance on AWS EKS

Slide 22

Slide 22 text

Packet Loss Demo

Slide 23

Slide 23 text

Kubernetes Dashboard Datadog Gremlin Rise in errors (400/500s) Packet Loss 60 seconds 70% Experiment #2 `kubernetes-dashboard` Slower responses, but ultimately success

Slide 24

Slide 24 text

Packet Loss Attack 1 container on AWS EKS

Slide 25

Slide 25 text

Blackhole Attack Demo

Slide 26

Slide 26 text

Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 payments 120 Seconds HTTP 400/500 errors

Slide 27

Slide 27 text

Blackhole Attack Payment Container on AWS EKS

Slide 28

Slide 28 text

Hipster Shop Datadog Blackhole Attack 1 Container Experiment #3 catalogue 60 Seconds HTTP 400/500 errors

Slide 29

Slide 29 text

Blackhole Attack Catalogue Container on AWS EKS

Slide 30

Slide 30 text

How to communicate results of your Chaos Engineering experiments?

Slide 31

Slide 31 text

Was it expected? Chaos Engineering uncovers unknown side effects. Was it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.

Slide 32

Slide 32 text

Fix the issues. Whether code, configuration or process - iterate and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.

Slide 33

Slide 33 text

Where can you get started?

Slide 34

Slide 34 text

Join us @ Chaos Conf chaosconf.io twitter.com/chaosconf San Francisco, September 26, 2019 Special code: “insider” for $49 tickets @tammybutow @gremlininc

Slide 35

Slide 35 text

35 Thank You Tammy Butow Principal SRE, Gremlin tammy@gremlin.com @tammybutow