Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday: Start Injecting Failure Today!

A97a75c945507f70992f579a730b0657?s=47 Doug Barth
November 11, 2013

Failure Friday: Start Injecting Failure Today!

Video: https://vimeo.com/79375466

Failure Friday is PagerDuty's manual process for injecting failure into our production systems. It is easy to get started using standard Linux tools. If you start injecting failure, you will discover things that didn't work the way you intended them to. Once you find and fix those issues, you'll have better reliability when failures are from your own hand.

A97a75c945507f70992f579a730b0657?s=128

Doug Barth

November 11, 2013
Tweet

Transcript

  1. Failure Friday! DEVOPSDAYS LONDON 2013

  2. OPERATIONS ENGINEER Doug Barth doug@pagerduty.com @dougbarth

  3. American Just George Washington riding an eagle and holding an

    RPG.
  4. Failure Friday “Do not fear failure” by Tomasz Stasiuk

  5. What is PagerDuty?

  6. 911 dispatch for IT 999 for you guys :)

  7. • Downstream providers fail • 3 phone providers • 3

    email providers • 6 SMS providers • PagerDuty providers fail • 2 cloud providers • 3 data centers Designed for reliability
  8. Hung up on details • Bugs in exceptional code paths

    • Systems not recovering as quickly as expected • What is normal when things are abnormal?
  9. None
  10. Simian Army • Chaos Monkey • Latency Monkey • Chaos

    Gorilla • AWS only • Requires ASGs “wp7wallpaper_Evil_monkey_09” by skyler817
  11. Keep it simple “KISS Band Member Cupcakes” by Clever Cupcakes

  12. Process “How to Draw an Owl” by Chester

  13. Get buy in “Angry Boss” by Kaushal Karkhanis

  14. Schedule • 1 hour recurring meeting • Developers & Operations

    • List of attacks and identify victim • Finish as much as possible
  15. Before starting • Disable cron jobs & CM system •

    Announce the start • Open up relevant dashboards • Leave alarms enabled
  16. Attacks • Test a single host and then DC •

    5 minutes • Return to a working state • Stop if things break
  17. Keep a log • Keep track of actions taken •

    Times are super important • Also track discoveries and TODOs • Share dashboards/metrics • Chat rooms make this easy
  18. Graphs are awesome

  19. Finishing up • Sound the all clear • Enable crons

    & CM • Move TODOs to issue tracker
  20. Attack Strategies “Unicorn Attack!” by Sam Howzit

  21. Process Failure service stop cassandra

  22. Reboot hosts shutdown -r now

  23. Network Isolation iptables -I INPUT 1 -p tcp --dport 9160

    -j DROP iptables -I INPUT 1 -p tcp --dport 7000 -j DROP iptables -I OUTPUT 1 -p tcp --sport 9160 -j DROP iptables -I OUTPUT 1 -p tcp --sport 7000 -j DROP
  24. Slow node tc qdisc add dev eth0 root netem delay

    500ms 100ms loss 5%
  25. Benefits “Results Reader Board” by Rosa Say

  26. Issues fixed • Aggressive restarts by monit • Large files

    on ext3 volumes • Failing to restart due to bad /etc/fstab file • High latency from network isolated cache • Low capacity with a lost DC • Missing alerts/metrics
  27. Cultural impact • Knowledge sharing • Highlights untestable systems •

    Keeps failure handling on everyone’s mind
  28. Software Engineer Ken Rose “FF ramps up new on-call people

    quickly. If we waited for errors to actually happen, it might be a while before a new on-call person actually sees a certain type of error (e.g., db host down). FF brings those types of errors to the surface so everyone involved can see what should happen.”
  29. Future plans “Robot Swordsman Fight.” by Patrick Gage Kelley

  30. Break more things • Start testing whole DC outages •

    Break multiple services at once • Distribute failure testing to teams • Automate
  31. Summary • Failures will happen • Proactively test failure handling

    now • Choose something easy: app server, cache • Automate later
  32. http://pagerduty.com/jobs Thank you. OPERATIONS ENGINEER Doug Barth doug@pagerduty.com @dougbarth