How to build a healthy on-call culture

How to build a healthy on-call culture

Raise your hand if you enjoy being buried in alerts or woken up at 2am? (Yeah... thought so.) Ever-rising customer expectations around high availability and performance put massive pressure on the teams who develop and support SaaS products. And teams are literally losing sleep over it.

Until outages and other incidents are a thing of the past, organizations need to invest in a way of dealing with them that won't lead to burn-out. In this session, you'll learn how to combine the latest tooling with DevOps practices in the pursuit of a sustainable incident response workflow. It's all about actionable alerts, training, transparency and learning from each incident.

8f43892395260c6ad14618987099ddcc?s=128

Serhat Can

April 11, 2019
Tweet

Transcript

  1. SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN How

    to build a healthy 
 on-call culture
  2. https://unsplash.com/photos/yO3whNbzxsc 2014 failure: https://www.theverge.com/2014/10/3/6414949/911-call-failures-fcc 2018 failure: https://edition.cnn.com/2018/12/28/us/centurylink-outage-911-calls/index.html

  3. ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT. https://unsplash.com/photos/Of8C-QHqagM

  4. The analysis revealed significant effects of extended work availability on

    the daily start-of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.
  5. https://www.atlassian.com/blog/software-teams/modern-software-development-trends

  6. https://en.dopl3r.com/memes/hot-topics/microservices/247404

  7. You build it, you run it. DR. WERNER VOGELS, CTO

    AMAZON
  8. Dev - Ops Developers on-call Dev - Management Increasing demands

  9. Dev - Ops Developers on-call Dev - Management Increasing demands

  10. Dev - Ops Developers on-call Dev - Management Increasing demands

  11. Everything was fine, until it wasn’t.

  12. INCIDENT TIMELINE Customers report problem. 5:30pm

  13. INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack

    app. 5:30pm 5:50pm
  14. INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack

    app. On-call engineer looks at recent changes on Jira. 5:30pm 5:50pm 5:55
  15. INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack

    app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. 5:30pm 5:50pm 5:55 6:00
  16. INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack

    app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. We bring in the incident response team and enter statuspage entry. 5:30pm 5:50pm 5:55 6:00 6:15
  17. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. 6:40pm
  18. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. 6:40pm 6:45pm
  19. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. 6:40pm 6:45pm 8:00pm
  20. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. 6:40pm 6:45pm 8:00pm 9:00pm
  21. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. Run data sync job and bring back the app into a healthy state. 6:40pm 6:45pm 8:00pm 9:00pm 2:00am
  22. Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

  23. ACTIONABLE ALERTS PROVIDE CONTEXT AND GUIDANCE TO REDUCE MTTR AND

    STRESS. TAKEAWAY 1
  24. Automated alerting Catch inconsistencies before customer impact. Group similar alerts

    automatically. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Actionable alerts
  25. Automated alerting Catch inconsistencies before customer impact. Group similar alerts

    automatically. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Actionable alerts
  26. Automated alerting Catch inconsistencies before customer impact. Group similar alerts

    automatically. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Actionable alerts
  27. TRAINING GIVES CONFIDENCE. TAKEAWAY 2

  28. Training Onboarding Get new engineers ready to be on-call. 


    Explain the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time.
  29. Onboarding Get new engineers ready to be on-call. 
 Explain

    the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time. Training
  30. TRANSPARENCY MAKES 
 ON-CALL MORE HUMANE. TAKEAWAY 3

  31. Open company, no bullshit Make it written, make it available.

    atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  32. Open company, no bullshit Make it written, make it available.

    atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  33. ANALYZE AND CONTINUOUSLY LEARN FROM EACH INCIDENT. TAKEAWAY 4

  34. Collect operational data Record every detail on on-call changes and

    incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  35. Collect operational data Record every detail on on-call changes and

    incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  36. Collect operational data Record every detail on on-call changes and

    incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  37. Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

  38. ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT. https://unsplash.com/photos/Of8C-QHqagM

  39. https://unsplash.com/photos/hRdVSYpffas ON-CALL CAN BE 
 HAPPIER AND HEALTHIER

  40. SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN Thank

    you!