How to build a healthy on-call culture leveraging DevOps and AWS

8f43892395260c6ad14618987099ddcc?s=47 Serhat Can
December 04, 2019

How to build a healthy on-call culture leveraging DevOps and AWS

Raise your hand if you enjoy being buried in alerts or awakened at 2 AM? (Yeah, thought so.) Ever-rising customer expectations around high availability and performance put massive pressure on the teams who develop and support SaaS products. And those teams are literally losing sleep over it. Until outages and other incidents are a thing of the past, organizations need to invest in a way of dealing with them that won't lead to burnout. In this session, learn how to combine the latest tooling with DevOps within AWS in the pursuit of a sustainable incident response workflow.

8f43892395260c6ad14618987099ddcc?s=128

Serhat Can

December 04, 2019
Tweet

Transcript

  1. SERHAT CAN | TECHNICAL EVANGELIST | AWS COMMUNITY HERO |

    @SRHTCN How to build a healthy on-call culture using DevOps and AWS
  2. @srhtcn https://unsplash.com/photos/yO3whNbzxsc 2014 failure: https://www.theverge.com/2014/10/3/6414949/911-call-failures-fcc 2018 failure: https://edition.cnn.com/2018/12/28/us/centurylink-outage-911-calls/index.html

  3. @srhtcn ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT.

    https://unsplash.com/photos/Of8C-QHqagM
  4. @srhtcn The analysis revealed significant effects of extended work availability

    on the daily start-of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.
  5. @srhtcn https://www.atlassian.com/blog/software-teams/modern-software-development-trends

  6. @srhtcn https://en.dopl3r.com/memes/hot-topics/microservices/247404

  7. @srhtcn You build it, you run it. DR. WERNER VOGELS,

    CTO AMAZON
  8. @srhtcn Dev - Ops Developers on-call Dev - Management Increasing

    demands
  9. @srhtcn Dev - Ops Developers on-call Dev - Management Increasing

    demands
  10. @srhtcn Dev - Ops Developers on-call Dev - Management Increasing

    demands
  11. @srhtcn Everything was fine, until it wasn’t.

  12. @srhtcn INCIDENT TIMELINE Customers report problem. 5:30pm

  13. @srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through

    Slack app. 5:30pm 5:50pm
  14. @srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through

    Slack app. On-call engineer looks at recent changes on Jira. 5:30pm 5:50pm 5:55
  15. @srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through

    Slack app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. 5:30pm 5:50pm 5:55 6:00
  16. @srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through

    Slack app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. We bring in the incident response team and enter statuspage entry. 5:30pm 5:50pm 5:55 6:00 6:15
  17. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. 6:40pm
  18. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. 6:40pm 6:45pm
  19. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. 6:40pm 6:45pm 8:00pm
  20. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. 6:40pm 6:45pm 8:00pm 9:00pm
  21. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. Run data sync job and bring back the app into a healthy state. 6:40pm 6:45pm 8:00pm 9:00pm 2:00am
  22. @srhtcn Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

  23. @srhtcn ACTIONABLE ALERTS PROVIDE CONTEXT AND GUIDANCE TO REDUCE MTTR

    AND STRESS. TAKEAWAY 1
  24. @srhtcn Automated alerting Catch inconsistencies before customer impact Tools: Amazon

    Cloudwatch, X-Ray, Systems Manager. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Take a look at AWS Systems Manager Automation. Actionable alerts
  25. @srhtcn Automated alerting Catch inconsistencies before customer impact Tools: Amazon

    Cloudwatch, X-Ray, Systems Manager. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Take a look at AWS Systems Manager Automation. Actionable alerts
  26. @srhtcn Automated alerting Catch inconsistencies before customer impact Tools: Amazon

    Cloudwatch, X-Ray, Systems Manager. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Take a look at AWS Systems Manager Automation. Actionable alerts
  27. @srhtcn TRAINING GIVES CONFIDENCE. TAKEAWAY 2

  28. @srhtcn Training Onboarding Get new engineers ready to be on-call.

    
 Explain the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time.
  29. @srhtcn Onboarding Get new engineers ready to be on-call. 


    Explain the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time. Training
  30. @srhtcn TRANSPARENCY MAKES 
 ON-CALL MORE HUMANE. TAKEAWAY 3

  31. @srhtcn Open company, no bullshit Make it written, make it

    available. atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  32. @srhtcn Open company, no bullshit Make it written, make it

    available. atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  33. @srhtcn ANALYZE AND CONTINUOUSLY LEARN FROM EACH INCIDENT. TAKEAWAY 4

  34. @srhtcn Collect operational data Record every detail on on-call changes

    and incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  35. @srhtcn Collect operational data Record every detail on on-call changes

    and incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  36. @srhtcn Collect operational data Record every detail on on-call changes

    and incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  37. @srhtcn Actionable alerts Training Transparency Analysis and learning KEY TAKEAWAYS

  38. @srhtcn ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT.

    https://unsplash.com/photos/Of8C-QHqagM
  39. @srhtcn https://unsplash.com/photos/hRdVSYpffas ON-CALL CAN BE 
 HAPPIER AND HEALTHIER @srhtcn

  40. Thank you! SERHAT CAN | TECHNICAL EVANGELIST | AWS COMMUNITY

    HERO | @SRHTCN