Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Strategies For Being On Call & Keeping Your Sanity At The Same Time

Strategies For Being On Call & Keeping Your Sanity At The Same Time

Presented at #humanops in San Francisco.

7067ff85573929e5257aa9e9c1069de9?s=128

Eric Sigler

June 29, 2016
Tweet

Transcript

  1. Strategies for being on-call & keeping your sanity at the

    same time. Eric Sigler, Engineering Manager, PagerDuty @esigler
  2. First things first. @esigler

  3. No silver bullets. @esigler

  4. 3AM pages are a form of technical debt collection. @esigler

  5. Invest time in the underlying issues. @esigler

  6. Empathy takes time to build. @esigler

  7. Before, During, After. @esigler

  8. Before Going On Call @esigler

  9. @esigler Before Going On Call > Get Help!

  10. “I don’t want to get woken up in the middle

    of the night.” @esigler Before Going On Call > Get Help!
  11. Question Everything. (Especially alerts.) @esigler Before Going On Call >

    Review your alerts
  12. How is the alert triggered? @esigler Before Going On Call

    > Review your alerts
  13. <Obligatory Nagios Joke Here> @esigler Before Going On Call >

    Review your alerts
  14. What should someone do when they get the alert? @esigler

    Before Going On Call > Review your alerts
  15. The “everything’s OK” alarm. @esigler Before Going On Call >

    Review your alerts
  16. Why does it matter to the business? @esigler Before Going

    On Call > Review your alerts
  17. Production is down at 3AM? I care! Staging is down

    at 3AM? Less so. @esigler Before Going On Call > Review your alerts
  18. Practice makes perfect. @esigler Before Going On Call > Practice

    Beforehand
  19. Game Days, Chaos Monkeys, Failure Fridays, pick what works for

    you. @esigler Before Going On Call > Practice Beforehand
  20. During Your On Call shift. @esigler

  21. Be flexible to those who are on call. @esigler During

    On-Call > Be Flexible.
  22. Kick responders off as soon as possible. @esigler During On-Call

    > Scope down responders quickly
  23. Scope down as soon as you know the business impact.

    @esigler During On-Call > Scope down based on impact quickly
  24. After Going On Call. @esigler

  25. Include the impact to the responder in your postmortems. @esigler

    After On-Call > Consider ALL Impacts In Postmortems
  26. Do a periodic review of what alerts were triggered. @esigler

    After On-Call > Take A Look At What Actually Happened
  27. https://github.com/etsy/ opsweekly https://opzzz.sh/ @esigler

  28. Let’s recap: The business needs to invest in on call

    Get the right people involved, build empathy Take a close look at what actually can wake someone up Have everyone practice and exercise the systems during the day Kick people off the call quickly Kick the problem to daytime hours quickly Consider the impact to the responder in postmortems Do a periodic review of how painful the on call period was @esigler
  29. On call can be reasonable, but it takes a lot

    of investment. @esigler
  30. Thanks! @esigler