USING CHAOS TO BUILD
RESILIENT SYSTEMS
@tammybütow, Gremlin
Slide 2
Slide 2 text
What’s the scale of your infra?
@tammybütow #QCONNYC
Slide 3
Slide 3 text
How many services do you
have running in production?
@tammybütow #QCONNYC
Slide 4
Slide 4 text
How many engineers do you
have at your company?
@tammybütow #QCONNYC
Slide 5
Slide 5 text
A Common Chaos Engineering Journey
@tammybütow, Gremlin
@tammybütow #QCONNYC
Slide 6
Slide 6 text
TOP 5 MOST POPULAR WAYS TO
USE CHAOS ENGINEERING IN 2018
@tammybütow #QCONNYC
Slide 7
Slide 7 text
ADVANCED USES OF CHAOS ENGINEERING
@tammybütow #QCONNYC
Slide 8
Slide 8 text
@tammybütow, Gremlin
@tammybütow #QCONNYC
What happened this week: June 2018 Slack Outage
Slide 9
Slide 9 text
@tammybütow, Gremlin
@tammybütow #QCONNYC
Slide 10
Slide 10 text
TAMMY BÜTOW
Principal SRE, Gremlin
Causing chaos in prod since 2009.
Previously SRE Manager @ Dropbox
leading Databases, Block Storage and
Code Workflows for 500 million users
and 800 engineers.
@tammybütow
%
@tammybütow #QCONNYC
Slide 11
Slide 11 text
GREMLIN
• We are practitioners of Chaos
Engineering
• We build software that helps engineers
build resilient systems in a safe, secure
and simple way.
• We offer 11 ways to inject chaos for your
Chaos Engineering experiments (e.g.
host/container packet loss and
shutdown)
@tammybütow #QCONNYC
Slide 12
Slide 12 text
PART 1: LAYING THE FOUNDATION
@tammybütow #QCONNYC
Slide 13
Slide 13 text
• A resilient system is a highly available and durable system.
• A resilient system can maintain an acceptable level of service in
the face of failure.
• A resilient system can weather the storm (a misconfiguration, a
large scale natural disaster or controlled chaos engineering).
@tammybütow #QCONNYC
Let’s Define A Resilient System:
Slide 14
Slide 14 text
It would be silly to give an
Olympic pole-vaulter a broom
and ban them from practicing!
@tammybütow #QCONNYC
Slide 15
Slide 15 text
“Thoughtful planned experiments designed
to reveal the weaknesses in our systems”
- Kolton Andrus, Gremlin CEO
@tammybütow, Gremlin
@tammybütow #QCONNYC
Slide 16
Slide 16 text
Inject something harmful in order to
build an immunity.
@tammybütow, Gremlin
@tammybütow #QCONNYC
Think of it like a vaccination:
Slide 17
Slide 17 text
Eventually systems will break
in many undesired ways.
Break them first on purpose with
controlled chaos!
@tammybütow #QCONNYC
Slide 18
Slide 18 text
DOGFOODING
• Using your own product.
• For us that means using
Gremlin for our Chaos
Engineering experiments.
• Failure Fridays
@tammybütow #QCONNYC
Slide 19
Slide 19 text
Failure Fridays are dedicated time for
teams to collaboratively focus on using
Chaos Engineering practices to reveal
weaknesses in your services.
@tammybütow #QCONNYC
Slide 20
Slide 20 text
WHY DO DISTRIBUTED SYSTEMS NEED CHAOS?
• Unusual hard to debug
failures are common
• Systems & companies scale
rapidly and Chaos
Engineering helps you learn
along the way
@tammybütow #QCONNYC
Slide 21
Slide 21 text
FULL-STACK CHAOS ENGINEERING
• You can inject chaos at any
layer.
• API, App, Cache, Database,
OS, Host, Network, Power &
more.
@tammybütow #QCONNYC
Slide 22
Slide 22 text
WHY RUN CHAOS ENGINEERING EXPERIMENTS?
@tammybütow #QCONNYC
Slide 23
Slide 23 text
Are you confident that your metrics and
alerting are as good as they should be?
#pagerpain
@tammybütow #QCONNYC
Slide 24
Slide 24 text
Are you confident your customers are
getting as good an experience
as they should be?
#customerpain
@tammybütow #QCONNYC
Slide 25
Slide 25 text
Are you losing money due to downtime
and broken features?
#businesspain
@tammybütow #QCONNYC
Slide 26
Slide 26 text
HOW DO YOU RUN
CHAOS ENGINEERING EXPERIMENTS?
@tammybütow #QCONNYC
Slide 27
Slide 27 text
HOW TO RUN A CHAOS ENGINEERING EXPERIMENT
• Form a hypothesis
• Consider blast radius
• Run experiment
• Measure results
• Find & fix issues or scale
⚡
@tammybütow #QCONNYC
Slide 28
Slide 28 text
Don’t run before you can walk
@tammybütow, Gremlin
@tammybütow #QCONNYC
Slide 29
Slide 29 text
The 3 Prerequisites for Chaos Engineering
@tammybütow, Gremlin
@tammybütow #QCONNYC
1. Monitoring & Observability
2. On-Call & Incident Management
3. Know Your Cost of Downtime Per Hour
Slide 30
Slide 30 text
What Do I Use For Monitoring & Observability?
@tammybütow, Gremlin
@tammybütow #QCONNYC
Slide 31
Slide 31 text
We All Need To Know The Cost Of Downtime
@tammybütow, Gremlin
@tammybütow #QCONNYC
Slide 32
Slide 32 text
We All Need Incident Management
@tammybütow, Gremlin
@tammybütow #QCONNYC
Slide 33
Slide 33 text
HOW TO CHOOSE A CHAOS EXPERIMENT
• Identify top 5 critical systems
• Choose 1 system
• Whiteboard the system
• Select attack: resource/
state/network
• Determine scope
⚡
@tammybütow #QCONNYC
Slide 34
Slide 34 text
WHAT SHOULD WE MEASURE?
• Availability — 500s
• Service specific KPIs
• System metrics: CPU, IO, Disk
• Customer complaints
@tammybütow #QCONNYC
Slide 35
Slide 35 text
HOW TO RUN YOUR OWN GAMEDAY!
@tammybütow #QCONNYC
gremlin.com/gameday
Slide 36
Slide 36 text
HOW TO RUN YOUR OWN GAMEDAY!
@tammybütow #QCONNYC
gremlin.com/gameday
Slide 37
Slide 37 text
EXAMPLE SYSTEM: KUBERNETES RETAIL STORE
User
Primary: kube-01
Node: kube-02
Node: kube-03
Node: kube-04
@tammybütow #QCONNYC
Slide 38
Slide 38 text
PART 2: RESOURCE
CHAOS ENGINEERING
@tammybütow #QCONNYC
Slide 39
Slide 39 text
We can increase CPU, Disk, IO & Memory
consumption to ensure monitoring is setup to
catch problems.
Important to catch issues before they turn into
high severity incidents (unable to purchase new
product!) and downtime for customers.
RESOURCE CHAOS
@tammybütow #QCONNYC
Slide 40
Slide 40 text
CPU CHAOS
@tammybütow #QCONNYC
Slide 41
Slide 41 text
https://github.com/tammybutow/chaosengineeringbootcamp
LET’S CREATE A “KNOWN-KNOWN” EXPERIMENT
@tammybütow #QCONNYC
Slide 42
Slide 42 text
CHAOS IN TOP
@tammybütow #QCONNYC
Slide 43
Slide 43 text
LET’S KILL THE CHAOS NOW
@tammybütow #QCONNYC
Slide 44
Slide 44 text
NO MORE CHAOS IN TOP
@tammybütow #QCONNYC
Slide 45
Slide 45 text
DISK CHAOS
@tammybütow #QCONNYC
Slide 46
Slide 46 text
DISK CHAOS
@tammybütow #QCONNYC
Slide 47
Slide 47 text
MEMORY CHAOS
@tammybütow #QCONNYC
Slide 48
Slide 48 text
MEMORY CHAOS
free -m
@tammybütow #QCONNYC
Slide 49
Slide 49 text
PART 3: STATE
CHAOS ENGINEERING
@tammybütow #QCONNYC
Slide 50
Slide 50 text
PROCESS CHAOS
@tammybütow #QCONNYC
Slide 51
Slide 51 text
Ways to create process chaos on purpose:
PROCESS CHAOS
• Kill one process
• Loop kill a process
• Spawn new processes
• Fork bomb
@tammybütow #QCONNYC
Slide 52
Slide 52 text
PROCESS CHAOS
pkill -u chaos
@tammybütow #QCONNYC
Slide 53
Slide 53 text
SHUTDOWN CHAOS
@tammybütow #QCONNYC
Slide 54
Slide 54 text
SHUTDOWN CHAOS
shutdown -h
@tammybütow #QCONNYC
Slide 55
Slide 55 text
WHAT ARE OTHER WAYS YOU CAN
TURN OFF A SERVER?
WHAT IF YOU WANT TO
TURN OFF EVERY SERVER
WHEN IT’S ONE WEEK OLD?
@tammybütow #QCONNYC
Slide 56
Slide 56 text
HALT, REBOOT & POWEROFF CHAOS
halt
@tammybütow #QCONNYC
Slide 57
Slide 57 text
WHAT ABOUT SHUTTING DOWN
CONTAINERS AND K8’S PODS?
@tammybütow #QCONNYC
Slide 58
Slide 58 text
THE MANY WAYS TO KILL CONTAINERS
• Kill self
• Kill a container from the host
• Use one container to kill another
• Use one container to kills several containers
• Use several containers to kill several
@tammybütow #QCONNYC
Slide 59
Slide 59 text
The average lifespan of a container is 2.5 days
And they fail in many unexpected ways.
@tammybütow #QCONNYC
Slide 60
Slide 60 text
TIME TRAVEL CHAOS
@tammybütow #QCONNYC
Slide 61
Slide 61 text
TIME TRAVEL CHAOS AKA CLOCK SKEW
ntpq
@tammybütow #QCONNYC
Slide 62
Slide 62 text
PART 4: NETWORK
CHAOS ENGINEERING
@tammybütow #QCONNYC
Slide 63
Slide 63 text
BLACKHOLE CHAOS
@tammybütow #QCONNYC
Slide 64
Slide 64 text
BLACKHOLE CHAOS
ip route show
@tammybütow #QCONNYC
Slide 65
Slide 65 text
DNS CHAOS
@tammybütow #QCONNYC
Slide 66
Slide 66 text
DNS CHAOS
@tammybütow #QCONNYC
Slide 67
Slide 67 text
DNS CHAOS
@tammybütow #QCONNYC
Slide 68
Slide 68 text
LATENCY CHAOS
@tammybütow #QCONNYC
Slide 69
Slide 69 text
LATENCY CHAOS
mtr google.com
@tammybütow #QCONNYC
Slide 70
Slide 70 text
PACKET LOSS CHAOS
@tammybütow #QCONNYC
Slide 71
Slide 71 text
PACKET LOSS CHAOS
@tammybütow #QCONNYC
Slide 72
Slide 72 text
PART 5: COMPLEX OUTAGES
@tammybütow #QCONNYC
Slide 73
Slide 73 text
We can combine different types of chaos
engineering experiments to reproduce
complicated outages.
Reproducing outages gives you confidence you
can handle it if/when it happens again.
@tammybütow #QCONNYC
Slide 74
Slide 74 text
Let’s go back in time to look at some of the
worst outage stories that kicked off
the introduction of chaos engineering.
@tammybütow #QCONNYC
Slide 75
Slide 75 text
DROPBOX’S WORST OUTAGE EVER
Some master-replica pairs were impacted
which resulted in the site going down.
https://blogs.dropbox.com/tech/2014/01/outage-post-mortem/
@tammybütow #QCONNYC
Slide 76
Slide 76 text
UBER’S DATABASE OUTAGE
1.Master log replication to S3 failed
2.Logs backed up on the primary
3.Alerts fired to engineer but they are ignored
4.Disk fills up on database primary
5.Engineer deletes unarchived WAL files
6.Error in config prevents promotion
— Matt Ranney, Uber, 2015
@tammybütow #QCONNYC
Slide 77
Slide 77 text
OUTAGES HAPPEN.
@tammybütow #QCONNYC
Slide 78
Slide 78 text
THERE ARE MANY MORE OUTAGES
YOU CAN READ ABOUT HERE:
https://github.com/danluu/post-mortems
@tammybütow #QCONNYC
Slide 79
Slide 79 text
HOW CAN YOU CONTINUE YOUR
CHAOS ENGINEERING JOURNEY?
@tammybütow #QCONNYC
Slide 80
Slide 80 text
JOIN THE CHAOS SLACK
GREMLIN.COM/SLACK
@tammybütow #QCONNYC
Slide 81
Slide 81 text
LEARN WITH THE GREMLIN COMMUNITY
GREMLIN.COM/COMMUNITY
@tammybütow #QCONNYC
Slide 82
Slide 82 text
THE FIRST CHAOS ENGINEERING CONFERENCE!
CHAOSCONF.IO
@tammybütow #QCONNYC