Practicing
Thoughtful Controlled
Chaos Engineering
Ho Ming Li || NGINX meetup
Slide 2
Slide 2 text
Ho Ming Li
@HoReaL @GremlinInc
Slide 3
Slide 3 text
Chaos Engineering
Thoughtful, planned experiments
designed to reveal the
weakness in our systems.
Slide 4
Slide 4 text
Like a vaccine, we inject
harm to build immunity.
Slide 5
Slide 5 text
Dedicated time for teams to collaboratively
run Chaos Experiments to reveal weaknesses
in your systems
GameDay
Slide 6
Slide 6 text
1. Why?
2. How?
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
No content
Slide 10
Slide 10 text
“Computers aren’t the thing. They’re the thing that gets us to the thing.”
- Halt and Catch Fire
Chaos Engineering isn’t the thing.
It’s the thing that gets us to Resilience.
Slide 11
Slide 11 text
Prime Down Amazon’s sale day turns
into fail day TechCrunch
Delta Outage Computer malfunction
results in nationwide ground stop NBC
Slack Outage Connectivity issues
hit workplaces WSJ
Observability
Get rid of the Fog of War so you
can clearly see the map and
strategize accordingly.
Gain Deep Insight with:
- Metrics
- Logging
- Request Tracing
Sort of Expected
Attack
- Can’t connect to
DynamoDB.
Expectation
- Frontend gets a 5XX Error
from Backend.
Slide 28
Slide 28 text
Magnified Wait
Attack
- Inject small amount of
latency between app and
database
Expectation
- Users experience delay
roughly same as injected
latency
Slide 29
Slide 29 text
I can see this,
but I can’t see that
Attack
- Consumer cannot connect
to Database
Expectation
- Consumer can no longer
process messages
Slide 30
Slide 30 text
Loosely coupled...
… or Not
Attack
- Container dies
Expectation
- Orchestrator will spawn
new container
Slide 31
Slide 31 text
“An Application”
“Edge” DNS, CDN
“Front End” LB, API
“Back End” App/Web Server
Queue, RDB, KV DB
Search Index
“Infrastructure”:
Container
Kubernetes
Virtual Machine
Physical Server
Storage
Network
Data Center
Geography
Slide 32
Slide 32 text
Don’t Forget
the Human
Last Updated: 04/01/2013
Last Validated: 02/01/2019
Slide 33
Slide 33 text
Reliably Yours
Break Things on Purpose
tinyurl.com/chaoseng
meetup.com/pro/chaos
Slide 34
Slide 34 text
Hard Disk (Storage)
NIC/Cables (Network)
Power Supply
Bugs in Apps
Unpredictable Load
Etc.
Slide 35
Slide 35 text
before
Slide 36
Slide 36 text
• A simple exercise or “box to check”
• an opportunity to maliciously expose faulty services
• A one time event
• A high-risk endeavor
What a GameDay isn’t:
Slide 37
Slide 37 text
What a GameDay is and can be
• A dedicated time to come together to gain insights
• The execution of one or more experiments
• The proof or disproof of a hypothesis
• A time to test, sometimes destructively, the resilience of your
application and architecture