Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday: Start Injecting Failure Today!

A97a75c945507f70992f579a730b0657?s=47 Doug Barth
September 12, 2014

Failure Friday: Start Injecting Failure Today!

DevOpsDays Toronto 2014

Video: http://vimeo.com/107528697

A97a75c945507f70992f579a730b0657?s=128

Doug Barth

September 12, 2014
Tweet

Transcript

  1. 9/15/14 @dougbarth DEVOPSDAYS TORONTO 2014 Failure Friday!

  2. 9/15/14 FAILURE FRIDAY! Dev Ops

  3. 9/15/14 FAILURE FRIDAY! DevOps Engineer

  4. 9/15/14 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK

  5. 9/15/14 FAILURE FRIDAY! How is babby PagerDuty formed?

  6. 9/15/14 FAILURE FRIDAY!

  7. 9/15/14 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3

    phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
  8. 9/15/14 Hung up on details FAILURE FRIDAY! Bugs in exceptional

    code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
  9. 9/15/14 FAILURE FRIDAY!

  10. 9/15/14 Simian Army FAILURE FRIDAY! Chaos Monkey Latency Monkey Chaos

    Gorilla Chaos Kong “WP7WALLPAPER_EVIL_MONKEY_09” BY SKYLER817
  11. 9/15/14 Keep it simple FAILURE FRIDAY! “KISS BAND MEMBER CUPCAKES”

    BY CLEVER CUPCAKES
  12. 9/15/14 Process FAILURE FRIDAY! “HOW TO DRAW AN OWL” BY

    CHESTER
  13. 9/15/14 Get buy in FAILURE FRIDAY! “ANGRY BOSS” BY KAUSHAL

    KARKHANIS
  14. 9/15/14 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &

    Operations List of attacks and identify victim Finish as much as possible
  15. 9/15/14 Before starting FAILURE FRIDAY! Disable cron jobs & CM

    system Announce the start Open up relevant dashboards Leave alarms enabled
  16. 9/15/14 Attacks FAILURE FRIDAY! Test a single host and then

    DC 5 minutes Return to a working state Stop if things break
  17. 9/15/14 Keep a log FAILURE FRIDAY! Keep track of actions

    taken Times are super important Also track discoveries and TODOs Share dashboards/metrics Chat rooms make this easy
  18. 9/15/14 Graphs are awesome FAILURE FRIDAY!

  19. 9/15/14 Finishing up FAILURE FRIDAY! Sound the all clear Enable

    crons & CM Move TODOs to issue tracker
  20. 9/15/14 Attack Strategies FAILURE FRIDAY! “UNICORN ATTACK!” BY SAM HOWZIT

  21. 9/15/14 FAILURE FRIDAY! SERVICE STOP CASSANDRA

  22. 9/15/14 FAILURE FRIDAY! SHUTDOWN -R NOW

  23. 9/15/14 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT

    9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP ! IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
  24. 9/15/14 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM

    DELAY 500MS 100MS LOSS 5%
  25. 9/15/14 “RESULTS READER BOARD” BY ROSA SAY

  26. 9/15/14 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large

    files on ext3 volumes Failing to restart due to bad /etc/fstab file High latency from network isolated cache Low capacity with a lost DC Missing alerts/metrics
  27. 9/15/14 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems

    Keeps failure handling on everyone’s mind
  28. 9/15/14 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY

  29. 9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC

    outages Break multiple services at once Distribute failure testing to teams Automate
  30. 9/15/14 Break more things FAILURE FRIDAY! Start testing whole DC

    outages Break multiple services at once Distribute failure testing to teams Automate
  31. 9/15/14 Summary FAILURE FRIDAY! Failures will happen Proactively test failure

    handling now Choose something easy: app server, cache Automate later
  32. 9/15/14 pagerduty.com/jobs Thank you.