Chaos Engineering: When the Network Breaks

1 Chaos Engineering When the network breaks June 13, 2019
Velocity San Jose Tammy Butow Principal SRE, Gremlin [email protected] @tammybutow

Every system is becoming a distributed system. THE PROBLEM

Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness
in our systems.

build an immunity

proactively

Deﬁne the Blast Radius

Improved Incident Management

Measuring the Cost of Downtime Cost = R + E
+ C + ( B + A ) During the Outage R = Revenue Lost E = Employee Productivity After the Outage C = Customer Chargebacks (SLA Breaches) Unquantiﬁable B = Brand Defamation A = Employee Attrition Amazon is estimated to lose $220,000/min The average e-commerce site loses $6,800/min

Network Chaos

Network Chaos Engineering Demos 01 02 Latency Injection Packet Loss
03 Blackhole

Hipster Shop Architecture

Latency Injection Demo

Hipster Shop Datadog Gremlin HTTP 400/500 errors Latency Attack 120
seconds Experiment #3 500ms latency should be a non-issue `frontend` 500ms delay

Live Demo

Packet Loss Demo

Kubernetes Dashboard Datadog Gremlin Rise in errors (400/500s) Packet Loss
60 seconds 70% Experiment #2 `kubernetes-dashboard` Slower responses, but ultimately success

Blackhole Attack Demo

Hipster Shop Datadog Gremlin HTTP 400/500 errors Blackhole Attack 120
seconds Experiment #1 `paymentservice` Drop all traffic Expect payments to fail and errors thrown

Results

How to communicate results of your Chaos Engineering experiments?

Was it expected? Chaos Engineering uncovers unknown side effects. Was
it detected? Ensuring that our monitoring is configured correctly is critical. Was it mitigated? When possible our systems should gracefully degrade.

Fix the issues. Whether code, configuration or process - iterate
and improve. Can you automate this? Regularly exercise past failures to prevent the drift into failure. Share your results! Prepare an Executive Summary of what you learned.

Where can you get started?

Join the Chaos Engineering Community gremlin.com/community Join us at Chaos
Conf September 26, 2019 @tammybutow @gremlininc

gremlin.com/tammy

35 Thank You Tammy Butow Principal SRE, Gremlin [email protected] @tammybutow

Chaos Engineering: When the Network Breaks

Chaos Engineering: When the Network Breaks

Tammy Bryant Butow

More Decks by Tammy Bryant Butow

Other Decks in Technology

Featured

Transcript

1 Chaos Engineering When the network breaks June 13, 2019

Every system is becoming a distributed system. THE PROBLEM

Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness

build an immunity

proactively

Deﬁne the Blast Radius

Improved Incident Management

Measuring the Cost of Downtime Cost = R + E

Network Chaos

Network Chaos Engineering Demos 01 02 Latency Injection Packet Loss

Hipster Shop Architecture

Latency Injection Demo

Hipster Shop Datadog Gremlin HTTP 400/500 errors Latency Attack 120

Live Demo

Packet Loss Demo

Kubernetes Dashboard Datadog Gremlin Rise in errors (400/500s) Packet Loss

Demo

Blackhole Attack Demo

Hipster Shop Datadog Gremlin HTTP 400/500 errors Blackhole Attack 120

Results

How to communicate results of your Chaos Engineering experiments?

Was it expected? Chaos Engineering uncovers unknown side effects. Was

Fix the issues. Whether code, configuration or process - iterate

Where can you get started?

Join the Chaos Engineering Community gremlin.com/community Join us at Chaos

gremlin.com/tammy

35 Thank You Tammy Butow Principal SRE, Gremlin [email protected] @tammybutow