Building a healthy on-call culture (meetup)

8f43892395260c6ad14618987099ddcc?s=47 Serhat Can
January 24, 2019

Building a healthy on-call culture (meetup)

Paging people just creates a series of problems unless you put enough resources to build a healthy "on-call" culture. Nobody wants to be buried into alerts or wake up at 2 am in the morning.

There are several points you have to take into account to make on-call suck less. At the center of each of these items, there are people. If you put your people at the center and design your incident response thinking about them in the first place, on-call becomes a competitive advantage.

In this presentation, Serhat will start defining on-call and why we need a robust on-call culture. At this point, he'll mention the impact of downtime and performance degradation such as direct revenue, and credibility losses. Then, continue listing 6 must-haves: - Be transparent - Share responsibilities - Get ready for wartime - Build resilient and sustainable systems - Create actionable alerts - Learn from your experiences. In each of these steps, there will be crucial points and pieces of advice to both developers and management systems. In the end, Serhat will show that our efforts in building a better on-call culture, will pay off as our people and user's happiness.

8f43892395260c6ad14618987099ddcc?s=128

Serhat Can

January 24, 2019
Tweet

Transcript

  1. Building a healthy on-call culture Serhat Can @srhtcn @opsgenie

  2. @srhtcn What is on-call duty? Available for work if necessary,

    especially in an emergency
  3. 1 in 5 employees in EU are on-call Taylor &

    Francis. (2015, August 4). How does being 'on-call' impact employee fatigue?. ScienceDaily. Retrieved April 5, 2018 from www.sciencedaily.com/releases/2015/08/150804074052.htm
  4. In Hospitals

  5. In Large Enterprises

  6. IT in general Operations engineers, DevOps, SRE - whatever you

    call Developers Customer success Sales Finance
  7. @srhtcn Who should be on-call?

  8. Current on-call types Disclaimer: these can change based on what

    your company does, at what level of abstraction they use hardware or software, and the company size. - Outsourced or dedicated on-call teams (whose job is to only respond to incidents) - SysAdmins or Operations engineers - Everyone involved in developing and operating software
  9. There is no “one” right way! Iterate over it, and

    find the best fit for your current organizational structure and culture
  10. On-call at Google For new (not so reliable) services For

    established services SRE (Site Reliability Engineers) Developers SRE Google requires development teams to run their own services if those systems aren’t stable.
  11. On-call at AirBnb, Pinterest, NewRelic Developers Developers are on-call for

    their services, but have SREs working alongside them (usually “embedded” within the team) Airbnb discovered that having a separate operations team “creates a divide and simply doesn’t scale.” SRE (Site Reliability Engineers)
  12. On-call at Datadog, Digital Ocean, and Dropbox Operations Developers DigitalOcean

    has both development teams and operational teams on-call, but with a twist: development teams are on-call for their services, while operations teams are on-call for the interactions between the services.
  13. On-call at AWS Developers Developers are responsible for all development

    and operational tasks associated with their services. Amazon’s cultural emphasis on “ownership”: you don’t “own” the code you write, Amazon says, unless you run and maintain it, too.
  14. @srhtcn The problem: Most people hate on-call.

  15. @srhtcn Is it a trap? Jan Dettmers, Journal of Occupational

    Health Psychology https://digest.bps.org.uk/2015/09/10/the-psychological-toll-of-being-off-duty-but-on-call/
  16. We love stress when it is the right amount of

    stress. It’s not for nothing that you don't have roller coaster rides going for three weeks. Robert Sapolsky When is Stress Good for You? https://www.youtube.com/watch?v=6x9zxSCYbVA
  17. @srhtcn Cost of data center outages *The Ponemon Institute &

    Emerson Network Power The impact of downtime & performance degradation Direct revenue loss Unhappy users Loss of credibility
  18. @srhtcn The impact of An unhealthy on-call culture Direct revenue

    loss Unhappy users and employees Loss of credibility Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png
  19. @srhtcn Don’t give up!

  20. Be transparent 1. @srhtcn

  21. @srhtcn Am I on-call!? what!

  22. @srhtcn Set responsibilities of on-call

  23. @srhtcn Be clear on availability of employees

  24. Share responsibilities 2. @srhtcn

  25. @srhtcn Create fair schedules Avoid inappropriate operational load and underload

    Follow the sun if you can
  26. @srhtcn Put developers on-call Rising expectations “You build it, you

    run it” - Werner Vogels
  27. @srhtcn

  28. Be prepared 3. @srhtcn

  29. @srhtcn

  30. @srhtcn Onboarding and training makes it perfect safer Explain the

    basics and set up alert notification rules Give access to the right tools Use shadowing
  31. @srhtcn

  32. @srhtcn Create runbooks Che t p i t o n

    o de m e h en Che t he g se t e s a c i n Ret e El c e c C U us f o N w ic g u t ac
  33. Build resilient and sustainable systems 4. @srhtcn

  34. How many 9s?

  35. @srhtcn Observe your applications Logging Metrics Distributed Tracing Alerts

  36. @srhtcn Apply Chaos Engineering Principles

  37. Create actionable alerts 5. @srhtcn

  38. @srhtcn Reduce noise Define what matters to you Prioritize and

    filter out useless alerts Don’t page for the alerts that you can fix in the morning!
  39. @srhtcn Route alerts to the right people

  40. @srhtcn

  41. Gather more investigative information Take remedial actions One click voice

    and video conferences One click actions on alerts
  42. Learn from your experiences 6. @srhtcn

  43. Continuously improve on-call Onboard new people Update rotations Fix repetitive

    alerts
  44. @srhtcn Create “blameless” post mortems

  45. @srhtcn

  46. @srhtcn

  47. @srhtcn 6 Steps towards a healthy on-call culture 1. Be

    transparent 2. Share responsibilities 3. Be prepared 4. Build resilient and sustainable systems 5. Create actionable alerts 6. Learn from your experiences
  48. @srhtcn We never achieve reliability at the expense of an

    on-call engineer’s health. - The Site Reliability Workbook
  49. Most importantly: Care about your people @srhtcn

  50. @srhtcn

  51. @srhtcn References - opsgenie.com/blog - Engineering.opsgenie.com - bit.ly/actionable-alerts - https://landing.google.com/sre/book/index.html

    - https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0 - unsplash.com and its supporters for amazing photos - Incident management for operations book
  52. @srhtcn @srhtcn