Slide 1

Slide 1 text

Building a healthy on-call culture Serhat Can @srhtcn @opsgenie

Slide 2

Slide 2 text

@srhtcn What is on-call duty? Available for work if necessary, especially in an emergency

Slide 3

Slide 3 text

@srhtcn What is on-call duty? Jan Dettmers, Journal of Occupational Health Psychology https://digest.bps.org.uk/2015/09/10/the-psychological-toll-of-being-off-duty-but-on-call/

Slide 4

Slide 4 text

We love stress when it is the right amount of stress. It’s not for nothing that you don't have roller coaster rides going for three weeks. Robert Sapolsky When is Stress Good for You? https://www.youtube.com/watch?v=6x9zxSCYbVA

Slide 5

Slide 5 text

@srhtcn The problem: Most people hate on-call.

Slide 6

Slide 6 text

@srhtcn Cost of data center outages *The Ponemon Institute & Emerson Network Power The impact of downtime & performance degradation Direct revenue loss Unhappy users Loss of credibility

Slide 7

Slide 7 text

@srhtcn The impact of An unhealthy on-call culture Direct revenue loss Unhappy users and employees Loss of credibility Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png

Slide 8

Slide 8 text

@srhtcn Don’t give up!

Slide 9

Slide 9 text

Be transparent 1. @srhtcn

Slide 10

Slide 10 text

@srhtcn Am I on-call!? what!

Slide 11

Slide 11 text

@srhtcn Set responsibilities of on-call

Slide 12

Slide 12 text

@srhtcn Be clear on availability of employees

Slide 13

Slide 13 text

Share responsibilities 2. @srhtcn

Slide 14

Slide 14 text

@srhtcn Create fair schedules Avoid inappropriate operational load and underload Follow the sun if you can

Slide 15

Slide 15 text

@srhtcn Put developers on-call Rising expectations “You build it, you run it” - Werner Vogels

Slide 16

Slide 16 text

@srhtcn

Slide 17

Slide 17 text

Be prepared 3. @srhtcn

Slide 18

Slide 18 text

@srhtcn Onboarding and training makes it perfect safer Explain the basics and set up alert notification rules Give access to the right tools Use shadowing

Slide 19

Slide 19 text

@srhtcn

Slide 20

Slide 20 text

@srhtcn Create runbooks Che t p i t o n o de m e h en Che t he g se t e s a c i n Ret e El c e c C U us f o N w ic g u t ac

Slide 21

Slide 21 text

Build resilient and sustainable systems 4. @srhtcn

Slide 22

Slide 22 text

@srhtcn

Slide 23

Slide 23 text

@srhtcn Observe your applications Logging Metrics Distributed Tracing Alerts

Slide 24

Slide 24 text

@srhtcn Apply Chaos Engineering Principles

Slide 25

Slide 25 text

Create actionable alerts 5. @srhtcn

Slide 26

Slide 26 text

@srhtcn Reduce noise Define what matters to you Prioritize and filter out useless alerts Don’t page for the alerts that you can fix in the morning!

Slide 27

Slide 27 text

@srhtcn Route alerts to the right people

Slide 28

Slide 28 text

@srhtcn

Slide 29

Slide 29 text

Learn from your experiences 6. @srhtcn

Slide 30

Slide 30 text

@srhtcn Create “blameless” post mortems

Slide 31

Slide 31 text

@srhtcn

Slide 32

Slide 32 text

@srhtcn

Slide 33

Slide 33 text

@srhtcn 6 Steps towards a healthy on-call culture 1. Be transparent 2. Share responsibilities 3. Be prepared 4. Build resilient and sustainable systems 5. Create actionable alerts 6. Learn from your experiences

Slide 34

Slide 34 text

@srhtcn We never achieve reliability at the expense of an on-call engineer’s health. - The Site Reliability Workbook

Slide 35

Slide 35 text

Most importantly: Care about your people @srhtcn

Slide 36

Slide 36 text

@srhtcn

Slide 37

Slide 37 text

@srhtcn References - opsgenie.com/blog - Engineering.opsgenie.com - bit.ly/actionable-alerts - https://landing.google.com/sre/book/index.html - https://medium.com/@copyconstruct/on-call-b0bd8c5ea4e0 - unsplash.com and its supporters for amazing photos - Incident management for operations book

Slide 38

Slide 38 text

@srhtcn @srhtcn