Building a healthy on-call culture (meetup)

Building a healthy on-call culture Serhat Can @srhtcn @opsgenie

@srhtcn What is on-call duty? Available for work if necessary,
especially in an emergency

1 in 5 employees in EU are on-call Taylor &
Francis. (2015, August 4). How does being 'on-call' impact employee fatigue?. ScienceDaily. Retrieved April 5, 2018 from www.sciencedaily.com/releases/2015/08/150804074052.htm

In Hospitals

In Large Enterprises

IT in general Operations engineers, DevOps, SRE - whatever you
call Developers Customer success Sales Finance

@srhtcn Who should be on-call?

Current on-call types Disclaimer: these can change based on what
your company does, at what level of abstraction they use hardware or software, and the company size. - Outsourced or dedicated on-call teams (whose job is to only respond to incidents) - SysAdmins or Operations engineers - Everyone involved in developing and operating software

There is no “one” right way! Iterate over it, and
find the best fit for your current organizational structure and culture

On-call at Google For new (not so reliable) services For
established services SRE (Site Reliability Engineers) Developers SRE Google requires development teams to run their own services if those systems aren’t stable.

On-call at AirBnb, Pinterest, NewRelic Developers Developers are on-call for
their services, but have SREs working alongside them (usually “embedded” within the team) Airbnb discovered that having a separate operations team “creates a divide and simply doesn’t scale.” SRE (Site Reliability Engineers)

On-call at Datadog, Digital Ocean, and Dropbox Operations Developers DigitalOcean
has both development teams and operational teams on-call, but with a twist: development teams are on-call for their services, while operations teams are on-call for the interactions between the services.

On-call at AWS Developers Developers are responsible for all development
and operational tasks associated with their services. Amazon’s cultural emphasis on “ownership”: you don’t “own” the code you write, Amazon says, unless you run and maintain it, too.

@srhtcn The problem: Most people hate on-call.

@srhtcn Is it a trap? Jan Dettmers, Journal of Occupational
Health Psychology https://digest.bps.org.uk/2015/09/10/the-psychological-toll-of-being-off-duty-but-on-call/

We love stress when it is the right amount of
stress. It’s not for nothing that you don't have roller coaster rides going for three weeks. Robert Sapolsky When is Stress Good for You? https://www.youtube.com/watch?v=6x9zxSCYbVA

@srhtcn Cost of data center outages *The Ponemon Institute &
Emerson Network Power The impact of downtime & performance degradation Direct revenue loss Unhappy users Loss of credibility

@srhtcn The impact of An unhealthy on-call culture Direct revenue
loss Unhappy users and employees Loss of credibility Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png

@srhtcn Don’t give up!

Be transparent 1. @srhtcn

@srhtcn Am I on-call!? what!

@srhtcn Set responsibilities of on-call

@srhtcn Be clear on availability of employees

Share responsibilities 2. @srhtcn

@srhtcn Create fair schedules Avoid inappropriate operational load and underload
Follow the sun if you can

@srhtcn Put developers on-call Rising expectations “You build it, you
run it” - Werner Vogels

@srhtcn

Be prepared 3. @srhtcn

@srhtcn

@srhtcn Onboarding and training makes it perfect safer Explain the
basics and set up alert notification rules Give access to the right tools Use shadowing

@srhtcn

@srhtcn Create runbooks Che t p i t o n
o de m e h en Che t he g se t e s a c i n Ret e El c e c C U us f o N w ic g u t ac

Build resilient and sustainable systems 4. @srhtcn

How many 9s?

@srhtcn Observe your applications Logging Metrics Distributed Tracing Alerts

@srhtcn Apply Chaos Engineering Principles

Create actionable alerts 5. @srhtcn

@srhtcn Reduce noise Define what matters to you Prioritize and
filter out useless alerts Don’t page for the alerts that you can fix in the morning!

@srhtcn Route alerts to the right people

@srhtcn

Gather more investigative information Take remedial actions One click voice
and video conferences One click actions on alerts

Learn from your experiences 6. @srhtcn

Continuously improve on-call Onboard new people Update rotations Fix repetitive
alerts

@srhtcn Create “blameless” post mortems

@srhtcn

@srhtcn 6 Steps towards a healthy on-call culture 1. Be
transparent 2. Share responsibilities 3. Be prepared 4. Build resilient and sustainable systems 5. Create actionable alerts 6. Learn from your experiences

@srhtcn We never achieve reliability at the expense of an
on-call engineer’s health. - The Site Reliability Workbook

Most importantly: Care about your people @srhtcn

@srhtcn

@srhtcn References - opsgenie.com/blog - Engineering.opsgenie.com - bit.ly/actionable-alerts - https://landing.google.com/sre/book/index.html
- https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0 - unsplash.com and its supporters for amazing photos - Incident management for operations book

@srhtcn @srhtcn

Building a healthy on-call culture (meetup)

Building a healthy on-call culture (meetup)

More Decks by Serhat Can

Other Decks in Technology

Featured

Transcript