“Build, test, revise, build. That was our motto.”
@tammybütow DEVOPS @ NIKE DAY
Slide 3
Slide 3 text
“People constantly updated testing documents,
and every day we made changes in the moment”
@tammybütow DEVOPS @ NIKE DAY
Slide 4
Slide 4 text
NIKE HYPERADAPT 1.0
@tammybütow DEVOPS @ NIKE DAY
Slide 5
Slide 5 text
@tammybütow DEVOPS @ NIKE DAY
Slide 6
Slide 6 text
People Put Nike Hyperadapt 1.0 To The Test:
• On The Court - 20 Basketball Research Athletes
• Pounding Pavement - running at Nike campus
• In The Gym - Nike Employee Training Classes
• Walking And Working - 150 Employees
@tammybütow DEVOPS @ NIKE DAY
Slide 7
Slide 7 text
How does this relate to
Chaos Engineering?
@tammybütow DEVOPS @ NIKE DAY
Slide 8
Slide 8 text
You can use Chaos Engineering to ensure your
systems are as resilient as your sneakers.
@tammybütow, Gremlin
@tammybütow DEVOPS @ NIKE DAY
Slide 9
Slide 9 text
@tammybütow, Gremlin
@tammybütow DEVOPS @ NIKE DAY
Slide 10
Slide 10 text
TAMMY BÜTOW
Principal SRE, Gremlin
Causing chaos in prod since 2009
@tammybütow
@tammybütow DEVOPS @ NIKE DAY
!
Slide 11
Slide 11 text
GREMLIN
• We are practitioners of Chaos
Engineering
• We build software that helps
engineers build resilient systems
• We offer 11 ways to inject chaos
for your Chaos Engineering
experiments
@tammybütow DEVOPS @ NIKE DAY
Slide 12
Slide 12 text
PART 1: LAYING THE FOUNDATION
@tammybütow DEVOPS @ NIKE DAY
Slide 13
Slide 13 text
It would be silly to give an
Olympic pole-vaulter a broom
and ban them from practicing!
@tammybütow DEVOPS @ NIKE DAY
Slide 14
Slide 14 text
“Thoughtful planned experiments designed
to reveal the weaknesses in our systems”
- Kolton Andrus, Gremlin CEO
@tammybütow, Gremlin
@tammybütow DEVOPS @ NIKE DAY
Slide 15
Slide 15 text
Eventually systems will break
in many undesired ways.
Break them first on purpose with
controlled chaos!
@tammybütow DEVOPS @ NIKE DAY
Slide 16
Slide 16 text
DOGFOODING
• Using your own product.
• For us that means using
Gremlin for our Chaos
Engineering experiments.
• Failure Fridays
@tammybütow DEVOPS @ NIKE DAY
Slide 17
Slide 17 text
Failure Fridays are dedicated time for
teams to collaboratively focus on using
Chaos Engineering practices to reveal
weaknesses in your services.
@tammybütow DEVOPS @ NIKE DAY
Slide 18
Slide 18 text
WHY DO DISTRIBUTED SYSTEMS NEED CHAOS?
• Unusual hard to debug
failures are common
• Systems & companies scale
rapidly and Chaos
Engineering helps you learn
along the way
@tammybütow DEVOPS @ NIKE DAY
Slide 19
Slide 19 text
FULL-STACK CHAOS ENGINEERING
• You can inject chaos at any
layer.
• API, App, Cache, Database,
OS, Host, Network, Power &
more.
@tammybütow DEVOPS @ NIKE DAY
Slide 20
Slide 20 text
WHY RUN CHAOS ENGINEERING EXPERIMENTS?
@tammybütow DEVOPS @ NIKE DAY
Slide 21
Slide 21 text
Are you confident that your metrics and
alerting are as good as they should be?
@tammybütow DEVOPS @ NIKE DAY
#pagerpain
Slide 22
Slide 22 text
Are you confident your customers are
getting as good an experience
as they should be?
@tammybütow DEVOPS @ NIKE DAY
#customerpain
Slide 23
Slide 23 text
Are you losing money due to downtime
and broken features?
@tammybütow DEVOPS @ NIKE DAY
#businesspain
Slide 24
Slide 24 text
HOW DO YOU RUN
CHAOS ENGINEERING EXPERIMENTS?
@tammybütow DEVOPS @ NIKE DAY
Slide 25
Slide 25 text
HOW TO RUN A CHAOS ENGINEERING EXPERIMENT
• Form a hypothesis
• Consider blast radius
• Run experiment
• Measure results
• Find & fix issues or scale
@tammybütow DEVOPS @ NIKE DAY
⚡
Slide 26
Slide 26 text
Don’t run before you can walk
@tammybütow, Gremlin
@tammybütow DEVOPS @ NIKE DAY
Slide 27
Slide 27 text
HOW TO CHOOSE A CHAOS EXPERIMENT
• Identify top 5 critical systems
• Choose 1 system
• Whiteboard the system
• Select attack: resource/
state/network
• Determine scope
@tammybütow DEVOPS @ NIKE DAY
⚡
Slide 28
Slide 28 text
WHAT SHOULD WE MEASURE?
• Availability — 500s
• Service specific KPIs
• System metrics: CPU, IO, Disk
• Customer complaints
@tammybütow DEVOPS @ NIKE DAY
Slide 29
Slide 29 text
EXAMPLE SYSTEM: KUBERNETES RETAIL STORE
@tammybütow DEVOPS @ NIKE DAY
User
Primary: kube-01
Node: kube-02
Node: kube-03
Node: kube-04
Slide 30
Slide 30 text
PART 2: RESOURCE
CHAOS ENGINEERING
@tammybütow DEVOPS @ NIKE DAY
Slide 31
Slide 31 text
@tammybütow DEVOPS @ NIKE DAY
We can increase CPU, Disk, IO & Memory
consumption to ensure monitoring is setup to
catch problems.
Important to catch issues before they turn into
high severity incidents (unable to purchase new
product!) and downtime for customers.
RESOURCE CHAOS
Slide 32
Slide 32 text
CPU CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 33
Slide 33 text
@tammybütow DEVOPS @ NIKE DAY
https://github.com/tammybutow/chaosengineeringbootcamp
LET’S CREATE A “KNOWN-KNOWN” EXPERIMENT
Slide 34
Slide 34 text
@tammybütow DEVOPS @ NIKE DAY
CHAOS IN TOP
Slide 35
Slide 35 text
@tammybütow DEVOPS @ NIKE DAY
LET’S KILL THE CHAOS NOW
Slide 36
Slide 36 text
@tammybütow DEVOPS @ NIKE DAY
NO MORE CHAOS IN TOP
Slide 37
Slide 37 text
DISK CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 38
Slide 38 text
@tammybütow DEVOPS @ NIKE DAY
DISK CHAOS
Slide 39
Slide 39 text
MEMORY CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 40
Slide 40 text
@tammybütow DEVOPS @ NIKE DAY
MEMORY CHAOS
free -m
Slide 41
Slide 41 text
PART 3: STATE
CHAOS ENGINEERING
@tammybütow DEVOPS @ NIKE DAY
Slide 42
Slide 42 text
PROCESS CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 43
Slide 43 text
@tammybütow DEVOPS @ NIKE DAY
Ways to create process chaos on purpose:
PROCESS CHAOS
• Kill one process
• Loop kill a process
• Spawn new processes
• Fork bomb
Slide 44
Slide 44 text
@tammybütow DEVOPS @ NIKE DAY
PROCESS CHAOS
pkill -u chaos
Slide 45
Slide 45 text
SHUTDOWN CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 46
Slide 46 text
@tammybütow DEVOPS @ NIKE DAY
SHUTDOWN CHAOS
shutdown -h
Slide 47
Slide 47 text
WHAT ARE OTHER WAYS YOU CAN
TURN OFF A SERVER?
WHAT IF YOU WANT TO
TURN OFF EVERY SERVER
WHEN IT’S ONE WEEK OLD?
@tammybütow DEVOPS @ NIKE DAY
Slide 48
Slide 48 text
@tammybütow DEVOPS @ NIKE DAY
HALT, REBOOT & POWEROFF CHAOS
halt
Slide 49
Slide 49 text
WHAT ABOUT SHUTTING DOWN
CONTAINERS AND K8’S PODS?
@tammybütow DEVOPS @ NIKE DAY
Slide 50
Slide 50 text
@tammybütow DEVOPS @ NIKE DAY
THE MANY WAYS TO KILL CONTAINERS
• Kill self
• Kill a container from the host
• Use one container to kill another
• Use one container to kills several containers
• Use several containers to kill several
Slide 51
Slide 51 text
The average lifespan of a container is 2.5 days
And they fail in many unexpected ways.
@tammybütow DEVOPS @ NIKE DAY
Slide 52
Slide 52 text
TIME TRAVEL CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 53
Slide 53 text
@tammybütow DEVOPS @ NIKE DAY
TIME TRAVEL CHAOS AKA CLOCK SKEW
ntpq
Slide 54
Slide 54 text
PART 4: NETWORK
CHAOS ENGINEERING
@tammybütow DEVOPS @ NIKE DAY
Slide 55
Slide 55 text
BLACKHOLE CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 56
Slide 56 text
@tammybütow DEVOPS @ NIKE DAY
BLACKHOLE CHAOS
ip route show
Slide 57
Slide 57 text
DNS CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 58
Slide 58 text
@tammybütow DEVOPS @ NIKE DAY
DNS CHAOS
Slide 59
Slide 59 text
@tammybütow DEVOPS @ NIKE DAY
DNS CHAOS
Slide 60
Slide 60 text
LATENCY CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 61
Slide 61 text
@tammybütow DEVOPS @ NIKE DAY
LATENCY CHAOS
mtr google.com
Slide 62
Slide 62 text
PACKET LOSS CHAOS
@tammybütow DEVOPS @ NIKE DAY
Slide 63
Slide 63 text
@tammybütow DEVOPS @ NIKE DAY
PACKET LOSS CHAOS
Slide 64
Slide 64 text
PART 5: COMPLEX OUTAGES
@tammybütow DEVOPS @ NIKE DAY
Slide 65
Slide 65 text
We can combine different types of chaos
engineering experiments to reproduce
complicated outages.
Reproducing outages gives you confidence you
can handle it if/when it happens again.
@tammybütow DEVOPS @ NIKE DAY
Slide 66
Slide 66 text
Let’s go back in time to look at some of the
worst outage stories that kicked off
the introduction of chaos engineering.
@tammybütow DEVOPS @ NIKE DAY
Slide 67
Slide 67 text
DROPBOX’S WORST OUTAGE EVER
@tammybütow DEVOPS @ NIKE DAY
Some master-replica pairs were impacted
which resulted in the site going down.
https://blogs.dropbox.com/tech/2014/01/outage-post-mortem/
Slide 68
Slide 68 text
UBER’S DATABASE OUTAGE
@tammybütow DEVOPS @ NIKE DAY
1.Master log replication to S3 failed
2.Logs backed up on the primary
3.Alerts fired to engineer but they are ignored
4.Disk fills up on database primary
5.Engineer deletes unarchived WAL files
6.Error in config prevents promotion
— Matt Ranney, Uber, 2015
Slide 69
Slide 69 text
OUTAGES HAPPEN.
@tammybütow DEVOPS @ NIKE DAY
Slide 70
Slide 70 text
THERE ARE MANY MORE OUTAGES
YOU CAN READ ABOUT HERE:
https://github.com/danluu/post-mortems
@tammybütow DEVOPS @ NIKE DAY
Slide 71
Slide 71 text
HOW CAN YOU CONTINUE YOUR
CHAOS ENGINEERING JOURNEY?
@tammybütow DEVOPS @ NIKE DAY
Slide 72
Slide 72 text
@tammybütow DEVOPS @ NIKE DAY
JOIN THE CHAOS SLACK
GREMLIN.COM/CHAOS
Slide 73
Slide 73 text
@tammybütow DEVOPS @ NIKE DAY
LEARN WITH THE GREMLIN COMMUNITY
GREMLIN.COM/COMMUNITY
Slide 74
Slide 74 text
THANK YOU
DEVOPS @ NIKE DAY
@tammybütow
#CHAOSENGINEERING