Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to build a healthy on-call culture

Serhat Can
September 23, 2019

How to build a healthy on-call culture

It is not easy but here is what you can do about it.

Serhat Can

September 23, 2019
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. How to build a healthy 
 on-call culture SERHAT CAN

    | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN
  2. The analysis revealed significant effects of extended work availability on

    the daily start- of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.
  3. Dev - Ops Developers on call Dev - Management Increasing

    demands Developers are often the most qualified people to identify and resolve problems. @SRHTCN
  4. Dev - Ops Developers on call Dev - Management Increasing

    demands No conflicting incentives. 
 The outcome is better testing, docs, monitoring and alerting. @SRHTCN
  5. Dev - Ops Developers on call Dev - Management Increasing

    demands Time spent by devs gives managers clear reasons to prioritize reliability work.
  6. INCIDENT TIMELINE Customers report inconsistency on UI. 5:30pm 5:50pm On

    call adds me as a responder. 
 So, I get paged. 6:00 Alerting team on call receives an alert.
  7. INCIDENT TIMELINE Customers report inconsistency on UI. We decide to

    bring in the incident response team together. 5:30pm 5:50pm 6:15 On call adds me as a responder. 
 So, I get paged. 6:00 Alerting team on call receives an alert.
  8. INCIDENT TIMELINE We disabled one of the clusters. This stopped

    the problem. After a lot of debugging, we found a bug. 6:40pm 8:00pm
  9. INCIDENT TIMELINE We disabled one of the clusters. This stopped

    the problem. After a lot of debugging, we found a bug. We sent the fix but there were still inconsistencies. 6:40pm 8:00pm 9:00pm
  10. INCIDENT TIMELINE We disabled one of the clusters. This stopped

    the problem. After a lot of debugging, we found a bug. We sent the fix but there were still inconsistencies. 6:40pm 8:00pm 9:00pm We ran the data sync job and brought back the app into a healthy state. 2:00am
  11. @SRHTCN LESSONS LEARNED 
 YOU HAVEN'T LEARNED ANYTHING 
 UNTIL

    YOU CHANGE YOUR BEHAVIOR
 -ANDREW CLAY SHAFER
  12. @SRHTCN Make it easy to call for help
 
 Build

    three step escalation paths for different priorities
 Arrange development duties during on call based on your pager load Assign on-call temporarily to the engineer making the deployment Heroism is not sustainable Even Iron Man needs backup
  13. @SRHTCN Alerts can become the best or worst friend of

    on-call engineers In 2010, a Massachusetts hospital patient died after alarms signaling a critical event went unnoticed by 10 nurses. The patient safety officials shared that there are many reported deaths because of malfunctioned, turned off, ignored, or unheard alarms. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
  14. @SRHTCN Proactively fight against the alert fatigue No alert is

    better than a lot of alerts. Make the different clear: tickets vs alerts
 Add context on alerts; runbooks, details, logs, on-click remediation actions
 Identify and review repeating alerts
  15. @SRHTCN Blameful comments help no one, assume good intentions People

    are not the root causes of incidents Why Blameless?
  16. @SRHTCN Embracing Open isn't easy. 
 It means being vulnerable,

    transparent, willing to f*** things up in front of others atlassian.com/open
  17. @SRHTCN Sharing knowledge makes everyone smarter Encourage people to speak

    up
 
 Make information accessible to everyone
 
 Create open on-call policies 
 Are engineers supposed to be on call during nights?
 If on call during nights, is there flexibility to work from home the next day 
 or start the next day later than usual?
 Are engineers supposed to do development work during the on-call time?
 Maximum how many times in a month would an engineer be on-call?
  18. @SRHTCN The best way to predict 
 the future is

    to create it. - Dr. Forrest C. Shaklee
  19. @SRHTCN Give developers on-call responsibilities Create sustainable rotations and clear

    escalation paths Be open and share knowledge Create a blameless culture, not just postmortems Embrace effective alerting practices Practice incident response Compensate on call KEY TAKEAWAYS
  20. @SRHTCN PUT PEOPLE FIRST. 
 THE REST WILL FOLLOW. @SRHTCN

    
 WE NEVER ACHIEVE RELIABILITY AT THE EXPENSE OF AN ON-CALL ENGINEER’S HEALTH 
 - THE SITE RELIABILITY WORKBOOK