Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a healthy on-call culture (meetup)

Serhat Can
January 24, 2019

Building a healthy on-call culture (meetup)

Paging people just creates a series of problems unless you put enough resources to build a healthy "on-call" culture. Nobody wants to be buried into alerts or wake up at 2 am in the morning.

There are several points you have to take into account to make on-call suck less. At the center of each of these items, there are people. If you put your people at the center and design your incident response thinking about them in the first place, on-call becomes a competitive advantage.

In this presentation, Serhat will start defining on-call and why we need a robust on-call culture. At this point, he'll mention the impact of downtime and performance degradation such as direct revenue, and credibility losses. Then, continue listing 6 must-haves: - Be transparent - Share responsibilities - Get ready for wartime - Build resilient and sustainable systems - Create actionable alerts - Learn from your experiences. In each of these steps, there will be crucial points and pieces of advice to both developers and management systems. In the end, Serhat will show that our efforts in building a better on-call culture, will pay off as our people and user's happiness.

Serhat Can

January 24, 2019
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. 1 in 5 employees in EU are on-call Taylor &

    Francis. (2015, August 4). How does being 'on-call' impact employee fatigue?. ScienceDaily. Retrieved April 5, 2018 from www.sciencedaily.com/releases/2015/08/150804074052.htm
  2. IT in general Operations engineers, DevOps, SRE - whatever you

    call Developers Customer success Sales Finance
  3. Current on-call types Disclaimer: these can change based on what

    your company does, at what level of abstraction they use hardware or software, and the company size. - Outsourced or dedicated on-call teams (whose job is to only respond to incidents) - SysAdmins or Operations engineers - Everyone involved in developing and operating software
  4. There is no “one” right way! Iterate over it, and

    find the best fit for your current organizational structure and culture
  5. On-call at Google For new (not so reliable) services For

    established services SRE (Site Reliability Engineers) Developers SRE Google requires development teams to run their own services if those systems aren’t stable.
  6. On-call at AirBnb, Pinterest, NewRelic Developers Developers are on-call for

    their services, but have SREs working alongside them (usually “embedded” within the team) Airbnb discovered that having a separate operations team “creates a divide and simply doesn’t scale.” SRE (Site Reliability Engineers)
  7. On-call at Datadog, Digital Ocean, and Dropbox Operations Developers DigitalOcean

    has both development teams and operational teams on-call, but with a twist: development teams are on-call for their services, while operations teams are on-call for the interactions between the services.
  8. On-call at AWS Developers Developers are responsible for all development

    and operational tasks associated with their services. Amazon’s cultural emphasis on “ownership”: you don’t “own” the code you write, Amazon says, unless you run and maintain it, too.
  9. @srhtcn Is it a trap? Jan Dettmers, Journal of Occupational

    Health Psychology https://digest.bps.org.uk/2015/09/10/the-psychological-toll-of-being-off-duty-but-on-call/
  10. We love stress when it is the right amount of

    stress. It’s not for nothing that you don't have roller coaster rides going for three weeks. Robert Sapolsky When is Stress Good for You? https://www.youtube.com/watch?v=6x9zxSCYbVA
  11. @srhtcn Cost of data center outages *The Ponemon Institute &

    Emerson Network Power The impact of downtime & performance degradation Direct revenue loss Unhappy users Loss of credibility
  12. @srhtcn The impact of An unhealthy on-call culture Direct revenue

    loss Unhappy users and employees Loss of credibility Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png
  13. @srhtcn Onboarding and training makes it perfect safer Explain the

    basics and set up alert notification rules Give access to the right tools Use shadowing
  14. @srhtcn Create runbooks Che t p i t o n

    o de m e h en Che t he g se t e s a c i n Ret e El c e c C U us f o N w ic g u t ac
  15. @srhtcn Reduce noise Define what matters to you Prioritize and

    filter out useless alerts Don’t page for the alerts that you can fix in the morning!
  16. Gather more investigative information Take remedial actions One click voice

    and video conferences One click actions on alerts
  17. @srhtcn 6 Steps towards a healthy on-call culture 1. Be

    transparent 2. Share responsibilities 3. Be prepared 4. Build resilient and sustainable systems 5. Create actionable alerts 6. Learn from your experiences
  18. @srhtcn We never achieve reliability at the expense of an

    on-call engineer’s health. - The Site Reliability Workbook
  19. @srhtcn References - opsgenie.com/blog - Engineering.opsgenie.com - bit.ly/actionable-alerts - https://landing.google.com/sre/book/index.html

    - https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0 - unsplash.com and its supporters for amazing photos - Incident management for operations book