Slide 1

Slide 1 text

Practicing Thoughtful Controlled Chaos Engineering Ho Ming Li || NGINX meetup

Slide 2

Slide 2 text

Ho Ming Li @HoReaL @GremlinInc

Slide 3

Slide 3 text

Chaos Engineering Thoughtful, planned experiments designed to reveal the weakness in our systems.

Slide 4

Slide 4 text

Like a vaccine, we inject harm to build immunity.

Slide 5

Slide 5 text

Dedicated time for teams to collaboratively run Chaos Experiments to reveal weaknesses in your systems GameDay

Slide 6

Slide 6 text

1. Why? 2. How?

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

“Computers aren’t the thing. They’re the thing that gets us to the thing.” - Halt and Catch Fire Chaos Engineering isn’t the thing. It’s the thing that gets us to Resilience.

Slide 11

Slide 11 text

Prime Down Amazon’s sale day turns into fail day TechCrunch Delta Outage Computer malfunction results in nationwide ground stop NBC Slack Outage Connectivity issues hit workplaces WSJ

Slide 12

Slide 12 text

$$$$$$$$$$$$$$$$$$$$ Reputation CX Employee Burnout

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Better Fire Fighting ≠ More Resilient

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Start Small Start Simple

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Start in Staging Mature to Production

Slide 21

Slide 21 text

GameDay Anatomy of a GameDay Experiment #1 Experiment #2 Attack (Inject Failure) Attack Attack Attack ... ... Experiment #3 Attack Attack ...

Slide 22

Slide 22 text

Stop Guessing. Observe it!

Slide 23

Slide 23 text

Observability Get rid of the Fog of War so you can clearly see the map and strategize accordingly. Gain Deep Insight with: - Metrics - Logging - Request Tracing

Slide 24

Slide 24 text

Win/Lose Pass/Fail

Slide 25

Slide 25 text

Hypothesis Results Next Step Resilient Resilient Automate Fail Fail Improve Resilient Fail Dig Deeper Fail Resilient Dig Deeper

Slide 26

Slide 26 text

GameDay Findings Teaser

Slide 27

Slide 27 text

Sort of Expected Attack - Can’t connect to DynamoDB. Expectation - Frontend gets a 5XX Error from Backend.

Slide 28

Slide 28 text

Magnified Wait Attack - Inject small amount of latency between app and database Expectation - Users experience delay roughly same as injected latency

Slide 29

Slide 29 text

I can see this, but I can’t see that Attack - Consumer cannot connect to Database Expectation - Consumer can no longer process messages

Slide 30

Slide 30 text

Loosely coupled... … or Not Attack - Container dies Expectation - Orchestrator will spawn new container

Slide 31

Slide 31 text

“An Application” “Edge” DNS, CDN “Front End” LB, API “Back End” App/Web Server Queue, RDB, KV DB Search Index “Infrastructure”: Container Kubernetes Virtual Machine Physical Server Storage Network Data Center Geography

Slide 32

Slide 32 text

Don’t Forget the Human Last Updated: 04/01/2013 Last Validated: 02/01/2019

Slide 33

Slide 33 text

Reliably Yours Break Things on Purpose tinyurl.com/chaoseng meetup.com/pro/chaos

Slide 34

Slide 34 text

Hard Disk (Storage) NIC/Cables (Network) Power Supply Bugs in Apps Unpredictable Load Etc.

Slide 35

Slide 35 text

before

Slide 36

Slide 36 text

• A simple exercise or “box to check” • an opportunity to maliciously expose faulty services • A one time event • A high-risk endeavor What a GameDay isn’t:

Slide 37

Slide 37 text

What a GameDay is and can be • A dedicated time to come together to gain insights • The execution of one or more experiments • The proof or disproof of a hypothesis • A time to test, sometimes destructively, the resilience of your application and architecture