Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a healthy on-call culture

Serhat Can
November 06, 2018

Building a healthy on-call culture

Paging people creates a series of problems unless you put enough resources to build a healthy “on-call” culture. Nobody wants to be buried into alerts or wake up at 2 am in the morning.

There are several points you have to take into account to make on-call suck less. At the center of each of these items, there are people. If you put your people at the center and design your incident response thinking about them in the first place, on-call becomes a competitive advantage.

Serhat Can

November 06, 2018
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. @srhtcn What is on-call duty? Jan Dettmers, Journal of Occupational

    Health Psychology https://digest.bps.org.uk/2015/09/10/the-psychological-toll-of-being-off-duty-but-on-call/
  2. We love stress when it is the right amount of

    stress. It’s not for nothing that you don't have roller coaster rides going for three weeks. Robert Sapolsky When is Stress Good for You? https://www.youtube.com/watch?v=6x9zxSCYbVA
  3. @srhtcn Cost of data center outages *The Ponemon Institute &

    Emerson Network Power The impact of downtime & performance degradation Direct revenue loss Unhappy users Loss of credibility
  4. @srhtcn The impact of An unhealthy on-call culture Direct revenue

    loss Unhappy users and employees Loss of credibility Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png
  5. @srhtcn Onboarding and training makes it perfect safer Explain the

    basics and set up alert notification rules Give access to the right tools Use shadowing
  6. @srhtcn Create runbooks Che t p i t o n

    o de m e h en Che t he g se t e s a c i n Ret e El c e c C U us f o N w ic g u t ac
  7. @srhtcn Reduce noise Define what matters to you Prioritize and

    filter out useless alerts Don’t page for the alerts that you can fix in the morning!
  8. @srhtcn 6 Steps towards a healthy on-call culture 1. Be

    transparent 2. Share responsibilities 3. Be prepared 4. Build resilient and sustainable systems 5. Create actionable alerts 6. Learn from your experiences
  9. @srhtcn We never achieve reliability at the expense of an

    on-call engineer’s health. - The Site Reliability Workbook
  10. @srhtcn References - opsgenie.com/blog - Engineering.opsgenie.com - bit.ly/actionable-alerts - https://landing.google.com/sre/book/index.html

    - https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0 - unsplash.com and its supporters for amazing photos - Incident management for operations book