Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to build a healthy on-call culture leveraging DevOps and AWS

Serhat Can
December 04, 2019

How to build a healthy on-call culture leveraging DevOps and AWS

Raise your hand if you enjoy being buried in alerts or awakened at 2 AM? (Yeah, thought so.) Ever-rising customer expectations around high availability and performance put massive pressure on the teams who develop and support SaaS products. And those teams are literally losing sleep over it. Until outages and other incidents are a thing of the past, organizations need to invest in a way of dealing with them that won't lead to burnout. In this session, learn how to combine the latest tooling with DevOps within AWS in the pursuit of a sustainable incident response workflow.

Serhat Can

December 04, 2019
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. SERHAT CAN | TECHNICAL EVANGELIST | AWS COMMUNITY HERO |

    @SRHTCN How to build a healthy on-call culture using DevOps and AWS
  2. @srhtcn ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT.

    https://unsplash.com/photos/Of8C-QHqagM
  3. @srhtcn The analysis revealed significant effects of extended work availability

    on the daily start-of-day mood and cortisol awakening response. EXTENDED WORK AVAILABILITY AND ITS RELATION WITH START-OF-DAY MOOD AND CORTISOL. J OCCUP HEALTH PSYCHOL. 2016 JAN;21(1):105-18. DOI: 10.1037/A0039602. EPUB 2015 AUG 3.
  4. @srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through

    Slack app. On-call engineer looks at recent changes on Jira. 5:30pm 5:50pm 5:55
  5. @srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through

    Slack app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. 5:30pm 5:50pm 5:55 6:00
  6. @srhtcn INCIDENT TIMELINE Customers report problem. Page “alerting” team through

    Slack app. On-call engineer looks at recent changes on Jira. On-call adds me as a responder. 
 So, I get paged. We bring in the incident response team and enter statuspage entry. 5:30pm 5:50pm 5:55 6:00 6:15
  7. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. 6:40pm 6:45pm
  8. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. 6:40pm 6:45pm 8:00pm
  9. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. 6:40pm 6:45pm 8:00pm 9:00pm
  10. @srhtcn INCIDENT TIMELINE We disable one of the clusters and

    stop the problem. Get alerts from Cloudwatch and associate them with the incident. After a lot of debugging, we find a bug. Fix the code and ship it. We still have inconsistencies. Run data sync job and bring back the app into a healthy state. 6:40pm 6:45pm 8:00pm 9:00pm 2:00am
  11. @srhtcn Automated alerting Catch inconsistencies before customer impact Tools: Amazon

    Cloudwatch, X-Ray, Systems Manager. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Take a look at AWS Systems Manager Automation. Actionable alerts
  12. @srhtcn Automated alerting Catch inconsistencies before customer impact Tools: Amazon

    Cloudwatch, X-Ray, Systems Manager. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Take a look at AWS Systems Manager Automation. Actionable alerts
  13. @srhtcn Automated alerting Catch inconsistencies before customer impact Tools: Amazon

    Cloudwatch, X-Ray, Systems Manager. Escalation paths Make it easy to call for help. Ensure someone is taking care of the problem. One click actions and guides Leverage one-click actions to triage and remediate issues. Have runbooks as guides. Take a look at AWS Systems Manager Automation. Actionable alerts
  14. @srhtcn Training Onboarding Get new engineers ready to be on-call.

    
 Explain the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time.
  15. @srhtcn Onboarding Get new engineers ready to be on-call. 


    Explain the basics and give access to right tools. 
 Use shadowing as you bring new people in. Game day Rehearse like it is real. Know your role during incidents and have fun at the same time. Training
  16. @srhtcn Open company, no bullshit Make it written, make it

    available. atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  17. @srhtcn Open company, no bullshit Make it written, make it

    available. atlassian.com/software/jira/ops/handbook Statuspage updates Communicate incident status with internal and external stakeholders. Transparency
  18. @srhtcn Collect operational data Record every detail on on-call changes

    and incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  19. @srhtcn Collect operational data Record every detail on on-call changes

    and incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  20. @srhtcn Collect operational data Record every detail on on-call changes

    and incident response process. Postmortems Write a detailed document on the incident. While doing that, don’t blame anyone. Compensate Remember: On-call is not leisure time. Give your employees something in return. Analysis and learning
  21. @srhtcn ON-CALL CAN BE A SOURCE OF STRESS AND BURNOUT.

    https://unsplash.com/photos/Of8C-QHqagM