Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday (Ignite Talk)

Rich Adams
December 09, 2015

Failure Friday (Ignite Talk)

A brief look at how we manually test for failure at PagerDuty.

This was an Ignite talk (20 slides, auto advancing every 15 seconds), so the slides might not be terribly useful on their own.

It was presented at the "12 Talks of Cloudmas" Advanced AWS Meetup, at the AWS Loft in San Francisco, December 9th 2015.

Also available from https://richadams.me/talks/ignite-failurefriday/

Rich Adams

December 09, 2015
Tweet

More Decks by Rich Adams

Other Decks in Technology

Transcript

  1. :(
    Your server has disappeared for no adequately
    explained reason. Happy debugging.
    If you’d like to know more, you can search online later for this error: ERR_DEAL_WITH_IT

    View full-size slide

  2. Embrace Failure
    "I come from a long line of quitters.
    My father was a quitter, my
    grandfather was a quitter… I was
    raised to give up."

    View full-size slide

  3. Simian Army
    "We have found that the best defense against major
    unexpected failures is to fail often. By frequently
    causing failures, we force our services to be built in a
    way that is more resilient."
    http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html
    https://github.com/Netflix/SimianArmy
    The Netflix Simian Army is available on GitHub:

    View full-size slide

  4. “But We’re Not There Yet…”
    “But We’re Not There Yet…”

    View full-size slide

  5. Failure Friday
    • 1 hour every Friday.
    • Operations and Development teams attend.
    • Manually cause failures (No automation required).
    • Uncovers issues.
    • Builds team culture.
    • Tests incident response…

    View full-size slide

  6. “Good Lord! I'm getting a reading
    of over 40 Mega-Fonzies!”

    View full-size slide

  7. Getting Started
    • Notify team(s).
    • Set the meeting.
    • Make an agenda.

    View full-size slide

  8. Get Ready to Rumble
    • Announce the start.
    • Disable cron jobs.
    • Leave alerts on!

    View full-size slide

  9. 1 break
    2 wait 5mins
    3 fix
    4 wait until !broken
    5 repeat

    View full-size slide

  10. • Suspend/Stop Process.
    • Reboot Host.
    • Network Isolation.
    • Network latency.

    View full-size slide

  11. tc qdisc add dev eth0 root netem \
    delay 500ms 100ms \
    loss 5%
    >

    View full-size slide

  12. >
    tc qdisc del dev eth0 root netem
    >

    View full-size slide

  13. • Found bugs (and fixed them).
    • Knowledge sharing between dev and ops.
    • Highlighted untestable systems.
    • Incident training.
    • Excellent for onboarding.

    View full-size slide