Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

Video: https://vimeo.com/130554527

What would happen to your system if one of your app servers died right now? What about your database server? What if they're just slow? Does your application handle it gracefully? Does your development team get paged? Are you sure?

Netflix famously uses their Simian Army to test these scenarios in production, but setting up that automation might be far down the priority list of a growing startup.

In this talk, we will discuss how PagerDuty started injecting failure into our production systems with minimal effort and the full support of the development teams. We will discuss why you should start proactively injecting failure and the exact steps you can take. We will go over the importance of setting an agenda, keeping a log of the actions taken, and todos that were uncovered. We will talk about why I think your metrics should be linkable, and why you should leave your alerts on during these planned failures. Finally, we will talk about the benefits your company will get from causing all this chaos. At the end of this talk, I hope to have inspired you to go start breaking your production systems, on purpose.

Doug Barth

May 05, 2015
Tweet

More Decks by Doug Barth

Other Decks in Technology

Transcript

  1. 5/5/15 Designed for reliability FAILURE FRIDAY! Downstream providers fail 3

    phone providers 3 email providers 6 SMS providers PagerDuty providers fail 2 cloud providers 3 data centers
  2. 5/5/15 Hung up on details FAILURE FRIDAY! Bugs in exceptional

    code paths Systems not recovering as quickly as expected What is normal when things are abnormal?
  3. 5/5/15 Schedule FAILURE FRIDAY! 1 hour recurring meeting Developers &

    Operations List of attacks and identify victim Finish as much as possible
  4. 5/5/15 Before starting FAILURE FRIDAY! Conference call Announce the start

    Disable CM system Open up relevant dashboards Leave alarms enabled
  5. 5/5/15 Attacks FAILURE FRIDAY! Test a single host and then

    DC 5 minutes Return to a working state Stop if things break
  6. 5/5/15 Keep a log FAILURE FRIDAY! Keep track of actions

    taken Times are super important Also track discoveries and TODOs
  7. 5/5/15 Finishing up FAILURE FRIDAY! Sound the all clear Enable

    configuration management Move TODOs to issue tracker
  8. 5/5/15 FAILURE FRIDAY! IPTABLES -I INPUT 1 -P TCP --DPORT

    9160 -J DROP IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP
  9. 5/5/15 Issues fixed FAILURE FRIDAY! Aggressive restarts by monit Large

    files on ext3 volumes Bad /etc/fstab file High latency from cache Low capacity with a lost DC Missing alerts/metrics
  10. 5/5/15 Break more things FAILURE FRIDAY! Start testing whole DC

    outages Break multiple services at once Distribute failure testing to teams Automate
  11. 5/5/15 Break more things FAILURE FRIDAY! Start testing whole DC

    outages Break multiple services at once Distribute failure testing to teams Automate
  12. 5/5/15 Summary FAILURE FRIDAY! Failures will happen Proactively test failure

    handling now Choose something easy Automate later