How to build a healthy on-call culture

SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN How
to build a healthy   on-call culture

https://unsplash.com/photos/yO3whNbzxsc 2014 failure: https://www.theverge.com/2014/10/3/6414949/911-call-failures-fcc 2018 failure: https://edition.cnn.com/2018/12/28/us/centurylink-outage-911-calls/index.html

ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT. https://unsplash.com/photos/Of8C-QHqagM

The analysis revealed significant effects of extended work availability on
the daily start-of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.

https://www.atlassian.com/blog/software-teams/modern-software-development-trends

https://en.dopl3r.com/memes/hot-topics/microservices/247404

You build it, you run it. DR. WERNER VOGELS, CTO
AMAZON

Dev - Ops Developers on-call Dev - Management Increasing demands

Everything was fine, until it wasn’t.

INCIDENT TIMELINE Customers report problem. 5:30pm

INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack
app. 5:30pm 5:50pm

app. On-call engineer looks at recent changes on Jira. 5:30pm 5:50pm 5:55

app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder.   So, I get paged. 5:30pm 5:50pm 5:55 6:00

app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder.   So, I get paged. We bring in the incident response team and enter statuspage entry. 5:30pm 5:50pm 5:55 6:00 6:15

INCIDENT TIMELINE We disable one of the clusters and stop
the problem. 6:40pm

the problem. Get alerts from Cloudwatch and associate them with the incident. 6:40pm 6:45pm

the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. 6:40pm 6:45pm 8:00pm

the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. 6:40pm 6:45pm 8:00pm 9:00pm

the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. Run data sync job and bring back the app into a healthy state. 6:40pm 6:45pm 8:00pm 9:00pm 2:00am

Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

ACTIONABLE ALERTS PROVIDE CONTEXT AND GUIDANCE TO REDUCE MTTR AND
STRESS. TAKEAWAY 1

Automated alerting Catch inconsistencies before customer impact. Group similar alerts
automatically. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Actionable alerts

TRAINING GIVES CONFIDENCE. TAKEAWAY 2

Training Onboarding Get new engineers ready to be on-call.  
Explain the basics and give access to right tools.   Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time.

Onboarding Get new engineers ready to be on-call.   Explain
the basics and give access to right tools.   Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time. Training

TRANSPARENCY MAKES   ON-CALL MORE HUMANE. TAKEAWAY 3

Open company, no bullshit Make it written, make it available.
atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency

ANALYZE AND CONTINUOUSLY LEARN FROM EACH INCIDENT. TAKEAWAY 4

Collect operational data Record every detail on on-call changes and
incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning

Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT. https://unsplash.com/photos/Of8C-QHqagM

https://unsplash.com/photos/hRdVSYpffas ON-CALL CAN BE   HAPPIER AND HEALTHIER

SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN Thank
you!

How to build a healthy on-call culture

How to build a healthy on-call culture

More Decks by Serhat Can

Other Decks in Technology

Featured

Transcript