$30 off During Our Annual Pro Sale. View Details »

Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

Failure Friday: Start Injecting Failure Today! (DevOpsDays Austin 2015)

Video: https://vimeo.com/130554527

What would happen to your system if one of your app servers died right now? What about your database server? What if they're just slow? Does your application handle it gracefully? Does your development team get paged? Are you sure?

Netflix famously uses their Simian Army to test these scenarios in production, but setting up that automation might be far down the priority list of a growing startup.

In this talk, we will discuss how PagerDuty started injecting failure into our production systems with minimal effort and the full support of the development teams. We will discuss why you should start proactively injecting failure and the exact steps you can take. We will go over the importance of setting an agenda, keeping a log of the actions taken, and todos that were uncovered. We will talk about why I think your metrics should be linkable, and why you should leave your alerts on during these planned failures. Finally, we will talk about the benefits your company will get from causing all this chaos. At the end of this talk, I hope to have inspired you to go start breaking your production systems, on purpose.

Doug Barth

May 05, 2015
Tweet

More Decks by Doug Barth

Other Decks in Technology

Transcript

  1. 5/5/15
    @dougbarth
    DEVOPSDAYS AUSTIN 2015
    Failure Friday!

    View Slide

  2. 5/5/15
    FAILURE FRIDAY!
    Dev
    Ops

    View Slide

  3. 5/5/15
    FAILURE FRIDAY!

    View Slide

  4. 5/5/15
    “DO NOT FEAR FAILURE” BY TOMASZ STASIUK

    View Slide

  5. 5/5/15
    FAILURE FRIDAY!
    How is babby PagerDuty formed?

    View Slide

  6. 5/5/15
    FAILURE FRIDAY!

    View Slide

  7. 5/5/15
    Designed for reliability
    FAILURE FRIDAY!
    Downstream providers fail
    3 phone providers
    3 email providers
    6 SMS providers
    PagerDuty providers fail
    2 cloud providers
    3 data centers

    View Slide

  8. 5/5/15
    Hung up on details
    FAILURE FRIDAY!
    Bugs in exceptional code paths
    Systems not recovering as quickly as
    expected
    What is normal when things are
    abnormal?

    View Slide

  9. 5/5/15
    FAILURE FRIDAY!

    View Slide

  10. 5/5/15
    FAILURE FRIDAY!

    View Slide

  11. 5/5/15
    Keep it simple
    “KISS BAND MEMBER CUPCAKES” BY CLEVER CUPCAKES

    View Slide

  12. 5/5/15
    Process
    “HOW TO DRAW AN OWL” BY CHESTER

    View Slide

  13. 5/5/15
    Get buy in
    “ANGRY BOSS” BY KAUSHAL KARKHANIS

    View Slide

  14. 5/5/15
    Schedule
    FAILURE FRIDAY!
    1 hour recurring meeting
    Developers & Operations
    List of attacks and identify victim
    Finish as much as possible

    View Slide

  15. 5/5/15
    Before starting
    FAILURE FRIDAY!
    Conference call
    Announce the start
    Disable CM system
    Open up relevant dashboards
    Leave alarms enabled

    View Slide

  16. 5/5/15
    Attacks
    FAILURE FRIDAY!
    Test a single host and then DC
    5 minutes
    Return to a working state
    Stop if things break

    View Slide

  17. 5/5/15
    Keep a log
    FAILURE FRIDAY!
    Keep track of actions taken
    Times are super important
    Also track discoveries and TODOs

    View Slide

  18. 5/5/15
    Use a dedicated chat room
    FAILURE FRIDAY!

    View Slide

  19. 5/5/15
    Finishing up
    FAILURE FRIDAY!
    Sound the all clear
    Enable configuration management
    Move TODOs to issue tracker

    View Slide

  20. 5/5/15
    Attack Strategies
    “UNICORN ATTACK!” BY SAM HOWZIT

    View Slide

  21. 5/5/15
    FAILURE FRIDAY!
    SERVICE CASSANDRA STOP

    View Slide

  22. 5/5/15
    FAILURE FRIDAY!
    SERVICE CASSANDRA PAUSE

    View Slide

  23. 5/5/15
    FAILURE FRIDAY!
    SHUTDOWN -R NOW

    View Slide

  24. 5/5/15
    FAILURE FRIDAY!
    IPTABLES -I INPUT 1 -P TCP --DPORT 9160 -J DROP
    IPTABLES -I INPUT 1 -P TCP --DPORT 7000 -J DROP
    IPTABLES -I OUTPUT 1 -P TCP --SPORT 9160 -J DROP
    IPTABLES -I OUTPUT 1 -P TCP --SPORT 7000 -J DROP

    View Slide

  25. 5/5/15
    FAILURE FRIDAY!
    TC QDISC ADD DEV ETH0 ROOT
    NETEM DELAY 500MS 100MS
    LOSS 15%

    View Slide

  26. 5/5/15
    “RESULTS READER BOARD” BY ROSA SAY

    View Slide

  27. 5/5/15
    Issues fixed
    FAILURE FRIDAY!
    Aggressive restarts by monit
    Large files on ext3 volumes
    Bad /etc/fstab file
    High latency from cache
    Low capacity with a lost DC
    Missing alerts/metrics

    View Slide

  28. 5/5/15
    Cultural impact
    FAILURE FRIDAY!
    Knowledge sharing
    Highlights untestable systems
    Keeps failure handling on everyone’s
    mind

    View Slide

  29. 5/5/15
    Future plans
    “ROBOT SWORDSMAN FIGHT.” BY PATRICK GAGE KELLEY

    View Slide

  30. 5/5/15
    Break more things
    FAILURE FRIDAY!
    Start testing whole DC outages
    Break multiple services at once
    Distribute failure testing to teams
    Automate

    View Slide

  31. 5/5/15
    Break more things
    FAILURE FRIDAY!
    Start testing whole DC outages
    Break multiple services at once
    Distribute failure testing to teams
    Automate

    View Slide

  32. 5/5/15
    Summary
    FAILURE FRIDAY!
    Failures will happen
    Proactively test failure handling now
    Choose something easy
    Automate later

    View Slide

  33. 5/5/15
    FAILURE FRIDAY!
    [email protected]
    PAGERDUTY.COM/JOBS

    View Slide

  34. 5/5/15
    pagerduty.com/jobs
    Thank you.

    View Slide