Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How HashiCorp SREs Built HCP's Incident Management Program

Martin Smith
March 08, 2022
11

How HashiCorp SREs Built HCP's Incident Management Program

Back in 2019 and 2020, HashiCorp’s Cloud SREs worked with our engineering teams to build an incident management program for the soon-to-be launched Cloud Platform. I’ll take you on a retrospective through how we initially approached this task, including a walk through the HashiCorp RFC process, as well as some of the questions we encountered along the way (for example, tooling selection challenges, or whether to use words like retrospective instead of postmortem). I’ll talk about how that process has continued to evolve into 2022, as well as what we think the future holds for incident management at HashiCorp.

Martin Smith

March 08, 2022
Tweet

Transcript

  1. None
  2. How HashiCorp SREs Built HCP's Incident Management Program HashiTalks 2022

  3. Site Reliability Engineer at HashiCorp he/him @martinb3 Martin Smith

  4. Building HashiCorp Cloud Platform. Standing on Terraform Cloud’s Shoulders 01

    Getting started
  5. HashiCorp’s writing culture is our superpower. We get to see

    how our colleagues think, emulate those processes, and get smarter ourselves. Margaret Gillette, Senior Director Talent Development
  6. None
  7. April 2021 October 2021 January 2021 February 2021 A timeline

    of the HashiCorp Cloud Platform July 2020 October 2020 April 2020 HCS Public Beta HCS GA HCP Consul Beta HCP Vault Beta HCP Consul GA HCP Vault GA HCP Packer Beta
  8. None
  9. Severities Kickoff During After Incident Management Facilitation Process to run

    Documentation/artifacts Action items Retrospective / Postmortem Monitoring/integrations Rotations Wake up/not wake up IaC On-call
  10. Distribution, size, shape of team On-call rotations Team structure matters

  11. We really care about humane incident culture But you can’t

    change it from the outside Culture comes from within
  12. Support requests, weekly data processing, etc – keeping the lights

    on Resist that urge. On-call is creepy
  13. But why does it matter And some more exposition or

    citation Blame is baggage
  14. But why does it matter And some more exposition or

    citation Incidents have reasons
  15. Not just for incidents anymore. Retro the IM process. Retro

    the retro process. Retro the on-call experience. All of it. Always be retro’ing
  16. Tooling for incidents. It broke everything 02 Next steps

  17. Form the incident team Page the right people Create a

    zoom room Make a Slack channel Ask people to post in Slack Build the retrospective doc Interrupt constantly Scribe what happened Schedule the retrospective Find a facilitator Nag everyone about everything Send emails The manual steps in running an incident 😭
  18. None
  19. None
  20. Site Reliability Engineer at HashiCorp he/him Michael Main

  21. Learning More from Incidents. Refining our process 03 Following Up

  22. Change is Hard Especially when the stakes are high

  23. Getting a Lay of the Land What we already had.

    Quantifiable data is already tracked by our incident management software. What we still need. Anything that the incident management software can’t know.
  24. Tags. And also the problem with tags. More Data, More

    Burden. Tags are an easy way to attribute qualitative data, but they are also another step for incident responders to remember.
  25. Getting the Timing Right Unburdening Incident Responders Retrospectives. Custom questions

    are a more engaging format for incident responders, and serving them after the incident concludes gives responders plenty of time to fill in the blanks.
  26. After the Incident. Retrospectives and Ops Review 04 Communicating

  27. Retrospectives How our understanding changed

  28. Summaries of Unusual Size? I don’t think they exist.

  29. None
  30. Ops Review Another audience to consider Aggregating Incident Data. How

    can we provide a brief summary of each incident, but retain the ability to get into technical specifics if need be?
  31. What the Future Holds. 04 Looking Ahead

  32. Tooling. Have our needs changed? What else is out there?

  33. Scaling Across Teams Standardization vs Customization Running incidents is easy

    except for when there’s an incident.
  34. More Effective Analysis Building on the data foundation to learn

    more Aggregate Analysis. What can we learn about our division or about the organization as a whole on a quarterly or annual basis? Team Analysis. How can we provide meaningful data on a team-by-team basis about incident response and service health over regular time periods? SLO Analysis. How well can we quantify incident impact against our SLOs and what conclusions can we draw about our SLOs as a result?
  35. Thank You hugs@hashicorp.com | learn.hashicorp.com | discuss.hashicorp.com