Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday: Start Injecting Failure Today!

Doug Barth
November 11, 2013

Failure Friday: Start Injecting Failure Today!

Video: https://vimeo.com/79375466

Failure Friday is PagerDuty's manual process for injecting failure into our production systems. It is easy to get started using standard Linux tools. If you start injecting failure, you will discover things that didn't work the way you intended them to. Once you find and fix those issues, you'll have better reliability when failures are from your own hand.

Doug Barth

November 11, 2013
Tweet

More Decks by Doug Barth

Other Decks in Technology

Transcript

  1. Failure Friday!
    DEVOPSDAYS LONDON 2013

    View Slide

  2. OPERATIONS ENGINEER
    Doug Barth
    [email protected]
    @dougbarth

    View Slide

  3. American
    Just George Washington riding an eagle and holding an RPG.

    View Slide

  4. Failure Friday
    “Do not fear failure” by Tomasz Stasiuk

    View Slide

  5. What is PagerDuty?

    View Slide

  6. 911 dispatch for IT
    999 for you guys :)

    View Slide

  7. • Downstream providers fail
    • 3 phone providers
    • 3 email providers
    • 6 SMS providers
    • PagerDuty providers fail
    • 2 cloud providers
    • 3 data centers
    Designed for reliability

    View Slide

  8. Hung up on details
    • Bugs in exceptional code paths
    • Systems not recovering as quickly
    as expected
    • What is normal when things are
    abnormal?

    View Slide

  9. View Slide

  10. Simian Army
    • Chaos Monkey
    • Latency Monkey
    • Chaos Gorilla
    • AWS only
    • Requires ASGs
    “wp7wallpaper_Evil_monkey_09”
    by skyler817

    View Slide

  11. Keep it simple
    “KISS Band Member Cupcakes” by Clever Cupcakes

    View Slide

  12. Process
    “How to Draw an Owl” by Chester

    View Slide

  13. Get buy in
    “Angry Boss” by Kaushal Karkhanis

    View Slide

  14. Schedule
    • 1 hour recurring meeting
    • Developers & Operations
    • List of attacks and identify victim
    • Finish as much as possible

    View Slide

  15. Before starting
    • Disable cron jobs & CM system
    • Announce the start
    • Open up relevant dashboards
    • Leave alarms enabled

    View Slide

  16. Attacks
    • Test a single host and then DC
    • 5 minutes
    • Return to a working state
    • Stop if things break

    View Slide

  17. Keep a log
    • Keep track of actions taken
    • Times are super important
    • Also track discoveries and TODOs
    • Share dashboards/metrics
    • Chat rooms make this easy

    View Slide

  18. Graphs are awesome

    View Slide

  19. Finishing up
    • Sound the all clear
    • Enable crons & CM
    • Move TODOs to issue tracker

    View Slide

  20. Attack Strategies
    “Unicorn Attack!” by Sam Howzit

    View Slide

  21. Process Failure
    service stop cassandra

    View Slide

  22. Reboot hosts
    shutdown -r now

    View Slide

  23. Network Isolation
    iptables -I INPUT 1 -p tcp --dport 9160 -j DROP
    iptables -I INPUT 1 -p tcp --dport 7000 -j DROP
    iptables -I OUTPUT 1 -p tcp --sport 9160 -j DROP
    iptables -I OUTPUT 1 -p tcp --sport 7000 -j DROP

    View Slide

  24. Slow node
    tc qdisc add dev eth0 root netem
    delay 500ms 100ms loss 5%

    View Slide

  25. Benefits
    “Results Reader Board” by Rosa Say

    View Slide

  26. Issues fixed
    • Aggressive restarts by monit
    • Large files on ext3 volumes
    • Failing to restart due to bad /etc/fstab file
    • High latency from network isolated cache
    • Low capacity with a lost DC
    • Missing alerts/metrics

    View Slide

  27. Cultural impact
    • Knowledge sharing
    • Highlights untestable systems
    • Keeps failure handling on
    everyone’s mind

    View Slide

  28. Software Engineer
    Ken Rose
    “FF ramps up new on-call people quickly. If we
    waited for errors to actually happen, it might be
    a while before a new on-call person actually
    sees a certain type of error (e.g., db host
    down). FF brings those types of errors to the
    surface so everyone involved can see what
    should happen.”

    View Slide

  29. Future plans
    “Robot Swordsman Fight.” by Patrick Gage Kelley

    View Slide

  30. Break more things
    • Start testing whole DC outages
    • Break multiple services at once
    • Distribute failure testing to teams
    • Automate

    View Slide

  31. Summary
    • Failures will happen
    • Proactively test failure handling now
    • Choose something easy: app
    server, cache
    • Automate later

    View Slide

  32. http://pagerduty.com/jobs
    Thank you.
    OPERATIONS ENGINEER
    Doug Barth
    [email protected]
    @dougbarth

    View Slide