Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday (Ignite Talk)

Rich Adams
December 09, 2015

Failure Friday (Ignite Talk)

A brief look at how we manually test for failure at PagerDuty.

This was an Ignite talk (20 slides, auto advancing every 15 seconds), so the slides might not be terribly useful on their own.

It was presented at the "12 Talks of Cloudmas" Advanced AWS Meetup, at the AWS Loft in San Francisco, December 9th 2015.

Also available from https://richadams.me/talks/ignite-failurefriday/

Rich Adams

December 09, 2015

More Decks by Rich Adams

Other Decks in Technology


  1. View Slide

  2. View Slide

  3. :(
    Your server has disappeared for no adequately
    explained reason. Happy debugging.
    If you’d like to know more, you can search online later for this error: ERR_DEAL_WITH_IT

    View Slide

  4. Embrace Failure
    "I come from a long line of quitters.
    My father was a quitter, my
    grandfather was a quitter… I was
    raised to give up."

    View Slide

  5. Simian Army
    "We have found that the best defense against major
    unexpected failures is to fail often. By frequently
    causing failures, we force our services to be built in a
    way that is more resilient."
    The Netflix Simian Army is available on GitHub:

    View Slide

  6. “But We’re Not There Yet…”
    “But We’re Not There Yet…”

    View Slide

  7. View Slide

  8. Failure Friday
    • 1 hour every Friday.
    • Operations and Development teams attend.
    • Manually cause failures (No automation required).
    • Uncovers issues.
    • Builds team culture.
    • Tests incident response…

    View Slide

  9. View Slide

  10. “Good Lord! I'm getting a reading
    of over 40 Mega-Fonzies!”

    View Slide

  11. Getting Started
    • Notify team(s).
    • Set the meeting.
    • Make an agenda.

    View Slide

  12. Get Ready to Rumble
    • Announce the start.
    • Disable cron jobs.
    • Leave alerts on!

    View Slide

  13. View Slide

  14. 1 break
    2 wait 5mins
    3 fix
    4 wait until !broken
    5 repeat

    View Slide

  15. • Suspend/Stop Process.
    • Reboot Host.
    • Network Isolation.
    • Network latency.

    View Slide

  16. tc qdisc add dev eth0 root netem \
    delay 500ms 100ms \
    loss 5%

    View Slide

  17. >
    tc qdisc del dev eth0 root netem

    View Slide

  18. View Slide

  19. • Found bugs (and fixed them).
    • Knowledge sharing between dev and ops.
    • Highlighted untestable systems.
    • Incident training.
    • Excellent for onboarding.

    View Slide

  20. View Slide