Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

Video: https://vimeo.com/130554527

What would happen to your system if one of your app servers died right now? What about your database server? What if they're just slow? Does your application handle it gracefully? Does your development team get paged? Are you sure?

Netflix famously uses their Simian Army to test these scenarios in production, but setting up that automation might be far down the priority list of a growing startup.

In this talk, we will discuss how PagerDuty started injecting failure into our production systems with minimal effort and the full support of the development teams. We will discuss why you should start proactively injecting failure and the exact steps you can take. We will go over the importance of setting an agenda, keeping a log of the actions taken, and todos that were uncovered. We will talk about why I think your metrics should be linkable, and why you should leave your alerts on during these planned failures. Finally, we will talk about the benefits your company will get from causing all this chaos. At the end of this talk, I hope to have inspired you to go start breaking your production systems, on purpose.

A97a75c945507f70992f579a730b0657?s=128

Doug Barth

May 05, 2015
Tweet

Transcript

  1. 5/5/15 @dougbarth DEVOPSDAYS AUSTIN 2015 Failure Friday!

  2. 5/5/15 FAILURE FRIDAY! Dev Ops

  3. 5/5/15 FAILURE FRIDAY!

  4. 5/5/15 “DO NOT FEAR FAILURE” BY TOMASZ STASIUK

  5. 5/5/15 FAILURE FRIDAY! How is babby PagerDuty formed?

  6. 5/5/15 FAILURE FRIDAY!

  7. 5/5/15 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3

    phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
  8. 5/5/15 Hung up on details FAILURE FRIDAY! Bugs in exceptional

    code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
  9. 5/5/15 FAILURE FRIDAY!

  10. 5/5/15 FAILURE FRIDAY!

  11. 5/5/15 Keep it simple “KISS BAND MEMBER CUPCAKES” BY CLEVER

    CUPCAKES
  12. 5/5/15 Process “HOW TO DRAW AN OWL” BY CHESTER

  13. 5/5/15 Get buy in “ANGRY BOSS” BY KAUSHAL KARKHANIS

  14. 5/5/15 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &

    Operations List of attacks and identify victim Finish as much as possible
  15. 5/5/15 Before starting FAILURE FRIDAY! Conference call Announce the start

    Disable CM system Open up relevant dashboards Leave alarms enabled
  16. 5/5/15 Attacks FAILURE FRIDAY! Test a single host and then

    DC 5 minutes Return to a working state Stop if things break
  17. 5/5/15 Keep a log FAILURE FRIDAY! Keep track of actions

    taken Times are super important Also track discoveries and TODOs
  18. 5/5/15 Use a dedicated chat room FAILURE FRIDAY!

  19. 5/5/15 Finishing up FAILURE FRIDAY! Sound the all clear Enable

    configuration management Move TODOs to issue tracker
  20. 5/5/15 Attack Strategies “UNICORN ATTACK!” BY SAM HOWZIT

  21. 5/5/15 FAILURE FRIDAY! SERVICE CASSANDRA STOP

  22. 5/5/15 FAILURE FRIDAY! SERVICE CASSANDRA PAUSE

  23. 5/5/15 FAILURE FRIDAY! SHUTDOWN -R NOW

  24. 5/5/15 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT

    9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
  25. 5/5/15 FAILURE FRIDAY! TC QDISC ADD DEV ETH0 ROOT NETEM

    DELAY 500MS 100MS LOSS 15%
  26. 5/5/15 “RESULTS READER BOARD” BY ROSA SAY

  27. 5/5/15 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large

    files on ext3 volumes Bad /etc/fstab file High latency from cache Low capacity with a lost DC Missing alerts/metrics
  28. 5/5/15 Cultural impact FAILURE FRIDAY! Knowledge sharing Highlights untestable systems

    Keeps failure handling on everyone’s mind
  29. 5/5/15 Future plans “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY

  30. 5/5/15 Break more things FAILURE FRIDAY! Start testing whole DC

    outages Break multiple services at once Distribute failure testing to teams Automate
  31. 5/5/15 Break more things FAILURE FRIDAY! Start testing whole DC

    outages Break multiple services at once Distribute failure testing to teams Automate
  32. 5/5/15 Summary FAILURE FRIDAY! Failures will happen Proactively test failure

    handling now Choose something easy Automate later
  33. 5/5/15 FAILURE FRIDAY! doug@pagerduty.com PAGERDUTY.COM/JOBS

  34. 5/5/15 pagerduty.com/jobs Thank you.