Slide 1

Slide 1 text

9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!

Slide 2

Slide 2 text

9/15/14 FAILURE FRIDAY! Dev Ops

Slide 3

Slide 3 text

9/15/14 FAILURE FRIDAY! DevOps Engineer

Slide 4

Slide 4 text

9/15/14 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK

Slide 5

Slide 5 text

9/15/14 FAILURE FRIDAY! How is babby PagerDuty formed?

Slide 6

Slide 6 text

9/15/14 FAILURE FRIDAY!

Slide 7

Slide 7 text

9/15/14 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3 phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers

Slide 8

Slide 8 text

9/15/14 Hung up on details FAILURE FRIDAY! Bugs in exceptional code paths Systems not recovering as quickly as expected What is normal when things are abnormal?

Slide 9

Slide 9 text

9/15/14 FAILURE FRIDAY!

Slide 10

Slide 10 text

9/15/14 Simian Army FAILURE FRIDAY! Chaos Monkey Latency Monkey Chaos Gorilla Chaos Kong “WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817

Slide 11

Slide 11 text

9/15/14 Keep it simple FAILURE FRIDAY! “KISS BAND MEMBER CUPCAKES” BY CLEVER CUPCAKES

Slide 12

Slide 12 text

9/15/14 Process FAILURE FRIDAY! “HOW TO DRAW AN OWL” BY CHESTER

Slide 13

Slide 13 text

9/15/14 Get buy in FAILURE FRIDAY! “ANGRY BOSS” BY KAUSHAL KARKHANIS

Slide 14

Slide 14 text

9/15/14 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers & Operations List of attacks and identify victim Finish as much as possible

Slide 15

Slide 15 text

9/15/14 Before starting FAILURE FRIDAY! Disable cron jobs & CM system Announce the start Open up relevant dashboards Leave alarms enabled

Slide 16

Slide 16 text

9/15/14 Attacks FAILURE FRIDAY! Test a single host and then DC 5 minutes Return to a working state Stop if things break

Slide 17

Slide 17 text

9/15/14 Keep a log FAILURE FRIDAY! Keep track of actions taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy

Slide 18

Slide 18 text

9/15/14 Graphs are awesome FAILURE FRIDAY!

Slide 19

Slide 19 text

9/15/14 Finishing up FAILURE FRIDAY! Sound the all clear Enable crons & CM Move TODOs to issue tracker

Slide 20

Slide 20 text

9/15/14 Attack Strategies FAILURE FRIDAY! “UNICORN ATTACK!” BY SAM HOWZIT

Slide 21

Slide 21 text

9/15/14 FAILURE FRIDAY! SERVICE STOP CASSANDRA

Slide 22

Slide 22 text

9/15/14 FAILURE FRIDAY! SHUTDOWN -R NOW

Slide 23

Slide 23 text

9/15/14 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT 9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP ! IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP

Slide 24

Slide 24 text

9/15/14 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM DELAY 500MS 100MS LOSS 5%

Slide 25

Slide 25 text

9/15/14 “RESULTS READER BOARD” BY ROSA SAY

Slide 26

Slide 26 text

9/15/14 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics

Slide 27

Slide 27 text

9/15/14 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems Keeps failure handling on everyone’s mind

Slide 28

Slide 28 text

9/15/14 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY

Slide 29

Slide 29 text

9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC outages Break multiple services at once Distribute failure testing to teams Automate

Slide 30

Slide 30 text

9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC outages Break multiple services at once Distribute failure testing to teams Automate

Slide 31

Slide 31 text

9/15/14 Summary FAILURE FRIDAY! Failures will happen Proactively test failure handling now Choose something easy: app server, cache Automate later

Slide 32

Slide 32 text

9/15/14 pagerduty.com/jobs Thank you.