Building a healthy on-call culture

8f43892395260c6ad14618987099ddcc?s=47 Serhat Can
November 06, 2018

Building a healthy on-call culture

Paging people creates a series of problems unless you put enough resources to build a healthy “on-call” culture. Nobody wants to be buried into alerts or wake up at 2 am in the morning.

There are several points you have to take into account to make on-call suck less. At the center of each of these items, there are people. If you put your people at the center and design your incident response thinking about them in the first place, on-call becomes a competitive advantage.

8f43892395260c6ad14618987099ddcc?s=128

Serhat Can

November 06, 2018
Tweet

Transcript

  1. Building a healthy on-call culture Serhat Can @srhtcn @opsgenie

  2. @srhtcn What is on-call duty? Available for work if necessary,

    especially in an emergency
  3. @srhtcn What is on-call duty? Jan Dettmers, Journal of Occupational

    Health Psychology https://digest.bps.org.uk/2015/09/10/the-psychological-toll-of-being-off-duty-but-on-call/
  4. We love stress when it is the right amount of

    stress. It’s not for nothing that you don't have roller coaster rides going for three weeks. Robert Sapolsky When is Stress Good for You? https://www.youtube.com/watch?v=6x9zxSCYbVA
  5. @srhtcn The problem: Most people hate on-call.

  6. @srhtcn Cost of data center outages *The Ponemon Institute &

    Emerson Network Power The impact of downtime & performance degradation Direct revenue loss Unhappy users Loss of credibility
  7. @srhtcn The impact of An unhealthy on-call culture Direct revenue

    loss Unhappy users and employees Loss of credibility Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png
  8. @srhtcn Don’t give up!

  9. Be transparent 1. @srhtcn

  10. @srhtcn Am I on-call!? what!

  11. @srhtcn Set responsibilities of on-call

  12. @srhtcn Be clear on availability of employees

  13. Share responsibilities 2. @srhtcn

  14. @srhtcn Create fair schedules Avoid inappropriate operational load and underload

    Follow the sun if you can
  15. @srhtcn Put developers on-call Rising expectations “You build it, you

    run it” - Werner Vogels
  16. @srhtcn

  17. Be prepared 3. @srhtcn

  18. @srhtcn Onboarding and training makes it perfect safer Explain the

    basics and set up alert notification rules Give access to the right tools Use shadowing
  19. @srhtcn

  20. @srhtcn Create runbooks Che t p i t o n

    o de m e h en Che t he g se t e s a c i n Ret e El c e c C U us f o N w ic g u t ac
  21. Build resilient and sustainable systems 4. @srhtcn

  22. @srhtcn

  23. @srhtcn Observe your applications Logging Metrics Distributed Tracing Alerts

  24. @srhtcn Apply Chaos Engineering Principles

  25. Create actionable alerts 5. @srhtcn

  26. @srhtcn Reduce noise Define what matters to you Prioritize and

    filter out useless alerts Don’t page for the alerts that you can fix in the morning!
  27. @srhtcn Route alerts to the right people

  28. @srhtcn

  29. Learn from your experiences 6. @srhtcn

  30. @srhtcn Create “blameless” post mortems

  31. @srhtcn

  32. @srhtcn

  33. @srhtcn 6 Steps towards a healthy on-call culture 1. Be

    transparent 2. Share responsibilities 3. Be prepared 4. Build resilient and sustainable systems 5. Create actionable alerts 6. Learn from your experiences
  34. @srhtcn We never achieve reliability at the expense of an

    on-call engineer’s health. - The Site Reliability Workbook
  35. Most importantly: Care about your people @srhtcn

  36. @srhtcn

  37. @srhtcn References - opsgenie.com/blog - Engineering.opsgenie.com - bit.ly/actionable-alerts - https://landing.google.com/sre/book/index.html

    - https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0 - unsplash.com and its supporters for amazing photos - Incident management for operations book
  38. @srhtcn @srhtcn