Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

5/5/15 @dougbarth DEVOPSDAYS AUSTIN 2015 Failure Friday!

5/5/15 FAILURE FRIDAY! Dev Ops

5/5/15 FAILURE FRIDAY!

5/5/15 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK

5/5/15 FAILURE FRIDAY! How is babby PagerDuty formed?

5/5/15 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3
phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers

5/5/15 Hung up on details FAILURE FRIDAY! Bugs in exceptional
code paths Systems not recovering as quickly as expected What is normal when things are abnormal?

5/5/15 Keep it simple “KISS BAND MEMBER CUPCAKES” BY CLEVER
CUPCAKES

5/5/15 Process “HOW TO DRAW AN OWL” BY CHESTER

5/5/15 Get buy in “ANGRY BOSS” BY KAUSHAL KARKHANIS

5/5/15 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &
Operations List of attacks and identify victim Finish as much as possible

5/5/15 Before starting FAILURE FRIDAY! Conference call Announce the start
Disable CM system Open up relevant dashboards Leave alarms enabled

5/5/15 Attacks FAILURE FRIDAY! Test a single host and then
DC 5 minutes Return to a working state Stop if things break

5/5/15 Keep a log FAILURE FRIDAY! Keep track of actions
taken Times are super important Also track discoveries and TODOs

5/5/15 Use a dedicated chat room FAILURE FRIDAY!

5/5/15 Finishing up FAILURE FRIDAY! Sound the all clear Enable
configuration management Move TODOs to issue tracker

5/5/15 Attack Strategies “UNICORN ATTACK!” BY SAM HOWZIT

5/5/15 FAILURE FRIDAY! SERVICE CASSANDRA STOP

5/5/15 FAILURE FRIDAY! SERVICE CASSANDRA PAUSE

5/5/15 FAILURE FRIDAY! SHUTDOWN -R NOW

5/5/15 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT
9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP

5/5/15 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM
DELAY 500MS 100MS LOSS 15%

5/5/15 “RESULTS READER BOARD” BY ROSA SAY

5/5/15 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large
files on ext3 volumes Bad /etc/fstab file High latency from cache Low capacity with a lost DC Missing alerts/metrics

5/5/15 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems
Keeps failure handling on everyone’s mind

5/5/15 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY

5/5/15 Break more things FAILURE FRIDAY! Start testing whole DC
outages Break multiple services at once Distribute failure testing to teams Automate

5/5/15 Summary FAILURE FRIDAY! Failures will happen Proactively test failure
handling now Choose something easy Automate later

5/5/15 FAILURE FRIDAY! [email protected] PAGERDUTY.COM/JOBS

5/5/15 pagerduty.com/jobs Thank you.

Failure Friday: Start Injecting Failure Today! ...

Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

More Decks by Doug Barth

Other Decks in Technology

Featured

Transcript