Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How HashiCorp SREs Built HCP's Incident Management Program

Martin Smith
March 08, 2022
28

How HashiCorp SREs Built HCP's Incident Management Program

Back in 2019 and 2020, HashiCorp’s Cloud SREs worked with our engineering teams to build an incident management program for the soon-to-be launched Cloud Platform. I’ll take you on a retrospective through how we initially approached this task, including a walk through the HashiCorp RFC process, as well as some of the questions we encountered along the way (for example, tooling selection challenges, or whether to use words like retrospective instead of postmortem). I’ll talk about how that process has continued to evolve into 2022, as well as what we think the future holds for incident management at HashiCorp.

Martin Smith

March 08, 2022
Tweet

Transcript

  1. HashiCorp’s writing culture is our superpower. We get to see

    how our colleagues think, emulate those processes, and get smarter ourselves. Margaret Gillette, Senior Director Talent Development
  2. April 2021 October 2021 January 2021 February 2021 A timeline

    of the HashiCorp Cloud Platform July 2020 October 2020 April 2020 HCS Public Beta HCS GA HCP Consul Beta HCP Vault Beta HCP Consul GA HCP Vault GA HCP Packer Beta
  3. Severities Kickoff During After Incident Management Facilitation Process to run

    Documentation/artifacts Action items Retrospective / Postmortem Monitoring/integrations Rotations Wake up/not wake up IaC On-call
  4. We really care about humane incident culture But you can’t

    change it from the outside Culture comes from within
  5. But why does it matter And some more exposition or

    citation Incidents have reasons
  6. Not just for incidents anymore. Retro the IM process. Retro

    the retro process. Retro the on-call experience. All of it. Always be retro’ing
  7. Form the incident team Page the right people Create a

    zoom room Make a Slack channel Ask people to post in Slack Build the retrospective doc Interrupt constantly Scribe what happened Schedule the retrospective Find a facilitator Nag everyone about everything Send emails The manual steps in running an incident 😭
  8. Getting a Lay of the Land What we already had.

    Quantifiable data is already tracked by our incident management software. What we still need. Anything that the incident management software can’t know.
  9. Tags. And also the problem with tags. More Data, More

    Burden. Tags are an easy way to attribute qualitative data, but they are also another step for incident responders to remember.
  10. Getting the Timing Right Unburdening Incident Responders Retrospectives. Custom questions

    are a more engaging format for incident responders, and serving them after the incident concludes gives responders plenty of time to fill in the blanks.
  11. Ops Review Another audience to consider Aggregating Incident Data. How

    can we provide a brief summary of each incident, but retain the ability to get into technical specifics if need be?
  12. More Effective Analysis Building on the data foundation to learn

    more Aggregate Analysis. What can we learn about our division or about the organization as a whole on a quarterly or annual basis? Team Analysis. How can we provide meaningful data on a team-by-team basis about incident response and service health over regular time periods? SLO Analysis. How well can we quantify incident impact against our SLOs and what conclusions can we draw about our SLOs as a result?