Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to build a healthy on-call culture

How to build a healthy on-call culture

Raise your hand if you enjoy being buried in alerts or woken up at 2am? (Yeah... thought so.) Ever-rising customer expectations around high availability and performance put massive pressure on the teams who develop and support SaaS products. And teams are literally losing sleep over it.

Until outages and other incidents are a thing of the past, organizations need to invest in a way of dealing with them that won't lead to burn-out. In this session, you'll learn how to combine the latest tooling with DevOps practices in the pursuit of a sustainable incident response workflow. It's all about actionable alerts, training, transparency and learning from each incident.

Serhat Can

April 11, 2019
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN How

    to build a healthy 
 on-call culture
  2. The analysis revealed significant effects of extended work availability on

    the daily start-of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.
  3. INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack

    app. On-call engineer looks at recent changes on Jira. 5:30pm 5:50pm 5:55
  4. INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack

    app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. 5:30pm 5:50pm 5:55 6:00
  5. INCIDENT TIMELINE Customers report problem. Page “alerting” team through Slack

    app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. We bring in the incident response team and enter statuspage entry. 5:30pm 5:50pm 5:55 6:00 6:15
  6. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. 6:40pm 6:45pm
  7. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. 6:40pm 6:45pm 8:00pm
  8. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. 6:40pm 6:45pm 8:00pm 9:00pm
  9. INCIDENT TIMELINE We disable one of the clusters and stop

    the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. Run data sync job and bring back the app into a healthy state. 6:40pm 6:45pm 8:00pm 9:00pm 2:00am
  10. Automated alerting Catch inconsistencies before customer impact. Group similar alerts

    automatically. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Actionable alerts
  11. Automated alerting Catch inconsistencies before customer impact. Group similar alerts

    automatically. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Actionable alerts
  12. Automated alerting Catch inconsistencies before customer impact. Group similar alerts

    automatically. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Actionable alerts
  13. Training Onboarding Get new engineers ready to be on-call. 


    Explain the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time.
  14. Onboarding Get new engineers ready to be on-call. 
 Explain

    the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time. Training
  15. Open company, no bullshit Make it written, make it available.

    atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  16. Open company, no bullshit Make it written, make it available.

    atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  17. Collect operational data Record every detail on on-call changes and

    incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  18. Collect operational data Record every detail on on-call changes and

    incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  19. Collect operational data Record every detail on on-call changes and

    incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning