Failure Friday (Ignite Talk)

8b4cbfc9f54093da73a489eed68d8f8d?s=47 Rich Adams
December 09, 2015

Failure Friday (Ignite Talk)

A brief look at how we manually test for failure at PagerDuty.

This was an Ignite talk (20 slides, auto advancing every 15 seconds), so the slides might not be terribly useful on their own.

It was presented at the "12 Talks of Cloudmas" Advanced AWS Meetup, at the AWS Loft in San Francisco, December 9th 2015.

Also available from https://richadams.me/talks/ignite-failurefriday/

8b4cbfc9f54093da73a489eed68d8f8d?s=128

Rich Adams

December 09, 2015
Tweet

Transcript

  1. None
  2. None
  3. :( Your server has disappeared for no adequately explained reason.

    Happy debugging. If you’d like to know more, you can search online later for this error: ERR_DEAL_WITH_IT
  4. Embrace Failure "I come from a long line of quitters.

    My father was a quitter, my grandfather was a quitter… I was raised to give up."
  5. Simian Army "We have found that the best defense against

    major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient." http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html https://github.com/Netflix/SimianArmy The Netflix Simian Army is available on GitHub:
  6. “But We’re Not There Yet…” “But We’re Not There Yet…”

  7. None
  8. Failure Friday • 1 hour every Friday. • Operations and

    Development teams attend. • Manually cause failures (No automation required). • Uncovers issues. • Builds team culture. • Tests incident response…
  9. None
  10. “Good Lord! I'm getting a reading of over 40 Mega-Fonzies!”

  11. Getting Started • Notify team(s). • Set the meeting. •

    Make an agenda.
  12. Get Ready to Rumble • Announce the start. • Disable

    cron jobs. • Leave alerts on!
  13. None
  14. 1 break <thing> 2 wait 5mins 3 fix <thing> 4

    wait until !broken 5 repeat
  15. • Suspend/Stop Process. • Reboot Host. • Network Isolation. •

    Network latency.
  16. tc qdisc add dev eth0 root netem \ delay 500ms

    100ms \ loss 5% >
  17. > tc qdisc del dev eth0 root netem >

  18. None
  19. • Found bugs (and fixed them). • Knowledge sharing between

    dev and ops. • Highlighted untestable systems. • Incident training. • Excellent for onboarding.
  20. None