Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failure Friday (Ignite Talk)

Rich Adams
December 09, 2015

Failure Friday (Ignite Talk)

A brief look at how we manually test for failure at PagerDuty.

This was an Ignite talk (20 slides, auto advancing every 15 seconds), so the slides might not be terribly useful on their own.

It was presented at the "12 Talks of Cloudmas" Advanced AWS Meetup, at the AWS Loft in San Francisco, December 9th 2015.

Also available from https://richadams.me/talks/ignite-failurefriday/

Rich Adams

December 09, 2015

More Decks by Rich Adams

Other Decks in Technology


  1. :( Your server has disappeared for no adequately explained reason.

    Happy debugging. If you’d like to know more, you can search online later for this error: ERR_DEAL_WITH_IT
  2. Embrace Failure "I come from a long line of quitters.

    My father was a quitter, my grandfather was a quitter… I was raised to give up."
  3. Simian Army "We have found that the best defense against

    major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient." http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html https://github.com/Netflix/SimianArmy The Netflix Simian Army is available on GitHub:
  4. Failure Friday • 1 hour every Friday. • Operations and

    Development teams attend. • Manually cause failures (No automation required). • Uncovers issues. • Builds team culture. • Tests incident response…
  5. 1 break <thing> 2 wait 5mins 3 fix <thing> 4

    wait until !broken 5 repeat
  6. • Found bugs (and fixed them). • Knowledge sharing between

    dev and ops. • Highlighted untestable systems. • Incident training. • Excellent for onboarding.