How HashiCorp SREs Built HCP's Incident Management Program

How HashiCorp SREs Built HCP's Incident Management Program HashiTalks 2022

Site Reliability Engineer at HashiCorp he/him @martinb3 Martin Smith

Building HashiCorp Cloud Platform. Standing on Terraform Cloud’s Shoulders 01
Getting started

HashiCorp’s writing culture is our superpower. We get to see
how our colleagues think, emulate those processes, and get smarter ourselves. Margaret Gillette, Senior Director Talent Development

April 2021 October 2021 January 2021 February 2021 A timeline
of the HashiCorp Cloud Platform July 2020 October 2020 April 2020 HCS Public Beta HCS GA HCP Consul Beta HCP Vault Beta HCP Consul GA HCP Vault GA HCP Packer Beta

Severities Kickoﬀ During After Incident Management Facilitation Process to run
Documentation/artifacts Action items Retrospective / Postmortem Monitoring/integrations Rotations Wake up/not wake up IaC On-call

Distribution, size, shape of team On-call rotations Team structure matters

We really care about humane incident culture But you can’t
change it from the outside Culture comes from within

Support requests, weekly data processing, etc – keeping the lights
on Resist that urge. On-call is creepy

But why does it matter And some more exposition or
citation Blame is baggage

But why does it matter And some more exposition or
citation Incidents have reasons

Not just for incidents anymore. Retro the IM process. Retro
the retro process. Retro the on-call experience. All of it. Always be retro’ing

Tooling for incidents. It broke everything 02 Next steps

Form the incident team Page the right people Create a
zoom room Make a Slack channel Ask people to post in Slack Build the retrospective doc Interrupt constantly Scribe what happened Schedule the retrospective Find a facilitator Nag everyone about everything Send emails The manual steps in running an incident 😭

Site Reliability Engineer at HashiCorp he/him Michael Main

Learning More from Incidents. Reﬁning our process 03 Following Up

Change is Hard Especially when the stakes are high

Getting a Lay of the Land What we already had.
Quantiﬁable data is already tracked by our incident management software. What we still need. Anything that the incident management software can’t know.

Tags. And also the problem with tags. More Data, More
Burden. Tags are an easy way to attribute qualitative data, but they are also another step for incident responders to remember.

Getting the Timing Right Unburdening Incident Responders Retrospectives. Custom questions
are a more engaging format for incident responders, and serving them after the incident concludes gives responders plenty of time to ﬁll in the blanks.

After the Incident. Retrospectives and Ops Review 04 Communicating

Retrospectives How our understanding changed

Summaries of Unusual Size? I don’t think they exist.

Ops Review Another audience to consider Aggregating Incident Data. How
can we provide a brief summary of each incident, but retain the ability to get into technical speciﬁcs if need be?

What the Future Holds. 04 Looking Ahead

Tooling. Have our needs changed? What else is out there?

Scaling Across Teams Standardization vs Customization Running incidents is easy
except for when there’s an incident.

More Eﬀective Analysis Building on the data foundation to learn
more Aggregate Analysis. What can we learn about our division or about the organization as a whole on a quarterly or annual basis? Team Analysis. How can we provide meaningful data on a team-by-team basis about incident response and service health over regular time periods? SLO Analysis. How well can we quantify incident impact against our SLOs and what conclusions can we draw about our SLOs as a result?

Thank You [email protected] | learn.hashicorp.com | discuss.hashicorp.com

How HashiCorp SREs Built HCP's Incident Managem...

How HashiCorp SREs Built HCP's Incident Management Program

Martin Smith

More Decks by Martin Smith

Featured

Transcript