Slide 1

Slide 1 text

1 Chaos Engineering When the network breaks June 13, 2019 Velocity San Jose Tammy Butow Principal SRE, Gremlin tammy@gremlin.com @tammybutow

Slide 2

Slide 2 text

Every system is becoming a distributed system. THE PROBLEM

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness in our systems.

Slide 5

Slide 5 text

build an immunity

Slide 6

Slide 6 text

proactively

Slide 7

Slide 7 text

Define the Blast Radius

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Improved Incident Management

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

Measuring the Cost of Downtime Cost = R + E + C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantifiable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min

Slide 12

Slide 12 text

Network Chaos

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Network Chaos Engineering Demos 01 02 Latency Injection Packet Loss 03 Blackhole

Slide 15

Slide 15 text

Hipster Shop Architecture

Slide 16

Slide 16 text

Latency Injection Demo

Slide 17

Slide 17 text

Hipster Shop Datadog Gremlin HTTP 400/500 errors Latency Attack 120 seconds Experiment #3 500ms latency should be a non-issue `frontend` 500ms delay

Slide 18

Slide 18 text

Live Demo

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Packet Loss Demo

Slide 21

Slide 21 text

Kubernetes Dashboard Datadog Gremlin Rise in errors (400/500s) Packet Loss 60 seconds 70% Experiment #2 `kubernetes-dashboard` Slower responses, but ultimately success

Slide 22

Slide 22 text

Demo

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Blackhole Attack Demo

Slide 25

Slide 25 text

Hipster Shop Datadog Gremlin HTTP 400/500 errors Blackhole Attack 120 seconds Experiment #1 `paymentservice` Drop all traffic Expect payments to fail and errors thrown

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

Results

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

How to communicate results of your Chaos Engineering experiments?

Slide 30

Slide 30 text

Was it expected? Chaos Engineering uncovers unknown side effects. Was it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.

Slide 31

Slide 31 text

Fix the issues. Whether code, configuration or process - iterate and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.

Slide 32

Slide 32 text

Where can you get started?

Slide 33

Slide 33 text

Join the Chaos Engineering Community gremlin.com/community Join us at Chaos Conf September 26, 2019 @tammybutow @gremlininc

Slide 34

Slide 34 text

gremlin.com/tammy

Slide 35

Slide 35 text

35 Thank You Tammy Butow Principal SRE, Gremlin tammy@gremlin.com @tammybutow