Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday: Start Injecting Failure Today!

Doug Barth
November 11, 2013

Failure Friday: Start Injecting Failure Today!

Video: https://vimeo.com/79375466

Failure Friday is PagerDuty's manual process for injecting failure into our production systems. It is easy to get started using standard Linux tools. If you start injecting failure, you will discover things that didn't work the way you intended them to. Once you find and fix those issues, you'll have better reliability when failures are from your own hand.

Doug Barth

November 11, 2013

More Decks by Doug Barth

Other Decks in Technology


  1. • Downstream providers fail • 3 phone providers • 3

    email providers • 6 SMS providers • PagerDuty providers fail • 2 cloud providers • 3 data centers Designed for reliability
  2. Hung up on details • Bugs in exceptional code paths

    • Systems not recovering as quickly as expected • What is normal when things are abnormal?
  3. Simian Army • Chaos Monkey • Latency Monkey • Chaos

    Gorilla • AWS only • Requires ASGs “wp7wallpaper_Evil_monkey_09” by skyler817
  4. Schedule • 1 hour recurring meeting • Developers & Operations

    • List of attacks and identify victim • Finish as much as possible
  5. Before starting • Disable cron jobs & CM system •

    Announce the start • Open up relevant dashboards • Leave alarms enabled
  6. Attacks • Test a single host and then DC •

    5 minutes • Return to a working state • Stop if things break
  7. Keep a log • Keep track of actions taken •

    Times are super important • Also track discoveries and TODOs • Share dashboards/metrics • Chat rooms make this easy
  8. Finishing up • Sound the all clear • Enable crons

    & CM • Move TODOs to issue tracker
  9. Network Isolation iptables -I INPUT 1 -p tcp --dport 9160

    -j DROP iptables -I INPUT 1 -p tcp --dport 7000 -j DROP iptables -I OUTPUT 1 -p tcp --sport 9160 -j DROP iptables -I OUTPUT 1 -p tcp --sport 7000 -j DROP
  10. Issues fixed • Aggressive restarts by monit • Large files

    on ext3 volumes • Failing to restart due to bad /etc/fstab file • High latency from network isolated cache • Low capacity with a lost DC • Missing alerts/metrics
  11. Software Engineer Ken Rose “FF ramps up new on-call people

    quickly. If we waited for errors to actually happen, it might be a while before a new on-call person actually sees a certain type of error (e.g., db host down). FF brings those types of errors to the surface so everyone involved can see what should happen.”
  12. Break more things • Start testing whole DC outages •

    Break multiple services at once • Distribute failure testing to teams • Automate
  13. Summary • Failures will happen • Proactively test failure handling

    now • Choose something easy: app server, cache • Automate later