How to build a healthy on-call culture

8f43892395260c6ad14618987099ddcc?s=47 Serhat Can
September 23, 2019

How to build a healthy on-call culture

It is not easy but here is what you can do about it.

8f43892395260c6ad14618987099ddcc?s=128

Serhat Can

September 23, 2019
Tweet

Transcript

  1. How to build a healthy 
 on-call culture SERHAT CAN

    | TECHNICAL EVANGELIST | ATLASSIAN | @SRHTCN
  2. @SRHTCN WHY > WHAT

  3. @SRHTCN 2014 failure: https://www.theverge.com/2014/10/3/6414949/911-call-failures-fcc 2018 failure: https://edition.cnn.com/2018/12/28/us/centurylink-outage-911-calls/index.html @SRHTCN

  4. @SRHTCN On Call

  5. The analysis revealed significant effects of extended work availability on

    the daily start- of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.
  6. @SRHTCN https://www.atlassian.com/blog/software-teams/modern-software-development-trends

  7. @SRHTCN You build it, you run it. DR. WERNER VOGELS,

    CTO AMAZON
  8. Dev - Ops Developers on call Dev - Management Increasing

    demands Developers are often the most qualified people to identify and resolve problems. @SRHTCN
  9. Dev - Ops Developers on call Dev - Management Increasing

    demands No conflicting incentives. 
 The outcome is better testing, docs, monitoring and alerting. @SRHTCN
  10. Dev - Ops Developers on call Dev - Management Increasing

    demands Time spent by devs gives managers clear reasons to prioritize reliability work.
  11. Everything was fine, until it wasn’t. @SRHTCN

  12. INCIDENT TIMELINE Customers report inconsistency on UI. 5:30pm

  13. INCIDENT TIMELINE Customers report inconsistency on UI. Alerting team on

    call receives an alert. 5:30pm 5:50pm
  14. INCIDENT TIMELINE Customers report inconsistency on UI. 5:30pm 5:50pm On

    call adds me as a responder. 
 So, I get paged. 6:00 Alerting team on call receives an alert.
  15. INCIDENT TIMELINE Customers report inconsistency on UI. We decide to

    bring in the incident response team together. 5:30pm 5:50pm 6:15 On call adds me as a responder. 
 So, I get paged. 6:00 Alerting team on call receives an alert.
  16. INCIDENT TIMELINE We disabled one of the clusters. This stopped

    the problem. 6:40pm
  17. INCIDENT TIMELINE We disabled one of the clusters. This stopped

    the problem. After a lot of debugging, we found a bug. 6:40pm 8:00pm
  18. INCIDENT TIMELINE We disabled one of the clusters. This stopped

    the problem. After a lot of debugging, we found a bug. We sent the fix but there were still inconsistencies. 6:40pm 8:00pm 9:00pm
  19. INCIDENT TIMELINE We disabled one of the clusters. This stopped

    the problem. After a lot of debugging, we found a bug. We sent the fix but there were still inconsistencies. 6:40pm 8:00pm 9:00pm We ran the data sync job and brought back the app into a healthy state. 2:00am
  20. @SRHTCN LESSONS LEARNED 
 YOU HAVEN'T LEARNED ANYTHING 
 UNTIL

    YOU CHANGE YOUR BEHAVIOR
 -ANDREW CLAY SHAFER
  21. @SRHTCN Heroism is not sustainable Even Iron Man needs backup

  22. @SRHTCN Make it easy to call for help
 
 Build

    three step escalation paths for different priorities
 Arrange development duties during on call based on your pager load Assign on-call temporarily to the engineer making the deployment Heroism is not sustainable Even Iron Man needs backup
  23. @SRHTCN Alerts can become the best or worst friend of

    on-call engineers In 2010, a Massachusetts hospital patient died after alarms signaling a critical event went unnoticed by 10 nurses. The patient safety officials shared that there are many reported deaths because of malfunctioned, turned off, ignored, or unheard alarms. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4926996/
  24. @SRHTCN Proactively fight against the alert fatigue No alert is

    better than a lot of alerts. Make the different clear: tickets vs alerts
 Add context on alerts; runbooks, details, logs, on-click remediation actions
 Identify and review repeating alerts
  25. @SRHTCN Blameless culture, 
 not just postmortems

  26. @SRHTCN Blameful comments help no one, assume good intentions People

    are not the root causes of incidents Why Blameless?
  27. @SRHTCN Embracing Open isn't easy. 
 It means being vulnerable,

    transparent, willing to f*** things up in front of others atlassian.com/open
  28. @SRHTCN Sharing knowledge makes everyone smarter Encourage people to speak

    up
 
 Make information accessible to everyone
 
 Create open on-call policies 
 Are engineers supposed to be on call during nights?
 If on call during nights, is there flexibility to work from home the next day 
 or start the next day later than usual?
 Are engineers supposed to do development work during the on-call time?
 Maximum how many times in a month would an engineer be on-call?
  29. @SRHTCN The best way to predict 
 the future is

    to create it. - Dr. Forrest C. Shaklee
  30. @SRHTCN Shadowing Game days Podcasts (@oncallnightmare) Team playbooks Practice boosts

    confidence
  31. @SRHTCN Compensate on call Remember: on-call time isn’t leisure time

  32. @SRHTCN Give developers on-call responsibilities Create sustainable rotations and clear

    escalation paths Be open and share knowledge Create a blameless culture, not just postmortems Embrace effective alerting practices Practice incident response Compensate on call KEY TAKEAWAYS
  33. @SRHTCN PUT PEOPLE FIRST. 
 THE REST WILL FOLLOW. @SRHTCN

    
 WE NEVER ACHIEVE RELIABILITY AT THE EXPENSE OF AN ON-CALL ENGINEER’S HEALTH 
 - THE SITE RELIABILITY WORKBOOK
  34. Thank you! SERHAT CAN | TECHNICAL EVANGELIST | ATLASSIAN |

    @SRHTCN