How to build a healthy on-call culture leveraging DevOps and AWS

SERHAT CAN | TECHNICAL EVANGELIST | AWS COMMUNITY HERO |
@SRHTCN How to build a healthy on-call culture using DevOps and AWS

@srhtcn https://unsplash.com/photos/yO3whNbzxsc 2014 failure: https://www.theverge.com/2014/10/3/6414949/911-call-failures-fcc 2018 failure: https://edition.cnn.com/2018/12/28/us/centurylink-outage-911-calls/index.html

@srhtcn ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT.
https://unsplash.com/photos/Of8C-QHqagM

@srhtcn The analysis revealed significant effects of extended work availability
on the daily start-of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.

@srhtcn https://www.atlassian.com/blog/software-teams/modern-software-development-trends

@srhtcn https://en.dopl3r.com/memes/hot-topics/microservices/247404

@srhtcn You build it, you run it. DR. WERNER VOGELS,
CTO AMAZON

@srhtcn Dev - Ops Developers on-call Dev - Management Increasing
demands

@srhtcn Everything was fine, until it wasn’t.

@srhtcn INCIDENT TIMELINE Customers report problem. 5:30pm

@srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through
Slack app. 5:30pm 5:50pm

Slack app. On-call engineer looks at recent changes on Jira. 5:30pm 5:50pm 5:55

Slack app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder.   So, I get paged. 5:30pm 5:50pm 5:55 6:00

Slack app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder.   So, I get paged. We bring in the incident response team and enter statuspage entry. 5:30pm 5:50pm 5:55 6:00 6:15

@srhtcn INCIDENT TIMELINE We disable one of the clusters and
stop the problem. 6:40pm

stop the problem. Get alerts from Cloudwatch and associate them with the incident. 6:40pm 6:45pm

stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. 6:40pm 6:45pm 8:00pm

stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. 6:40pm 6:45pm 8:00pm 9:00pm

stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. Run data sync job and bring back the app into a healthy state. 6:40pm 6:45pm 8:00pm 9:00pm 2:00am

@srhtcn Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

@srhtcn ACTIONABLE ALERTS PROVIDE CONTEXT AND GUIDANCE TO REDUCE MTTR
AND STRESS. TAKEAWAY 1

@srhtcn Automated alerting Catch inconsistencies before customer impact Tools: Amazon
Cloudwatch, X-Ray, Systems Manager. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Take a look at AWS Systems Manager Automation. Actionable alerts

@srhtcn TRAINING GIVES CONFIDENCE. TAKEAWAY 2

@srhtcn Training Onboarding Get new engineers ready to be on-call.
  Explain the basics and give access to right tools.   Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time.

@srhtcn Onboarding Get new engineers ready to be on-call.  
Explain the basics and give access to right tools.   Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time. Training

@srhtcn TRANSPARENCY MAKES   ON-CALL MORE HUMANE. TAKEAWAY 3

@srhtcn Open company, no bullshit Make it written, make it
available. atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency

@srhtcn ANALYZE AND CONTINUOUSLY LEARN FROM EACH INCIDENT. TAKEAWAY 4

@srhtcn Collect operational data Record every detail on on-call changes
and incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning

@srhtcn Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

@srhtcn ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT.
https://unsplash.com/photos/Of8C-QHqagM

@srhtcn https://unsplash.com/photos/hRdVSYpffas ON-CALL CAN BE   HAPPIER AND HEALTHIER @srhtcn

Thank you! SERHAT CAN | TECHNICAL EVANGELIST | AWS COMMUNITY
HERO | @SRHTCN

How to build a healthy on-call culture leveragi...

How to build a healthy on-call culture leveraging DevOps and AWS

More Decks by Serhat Can

Other Decks in Technology

Featured

Transcript