How HashiCorp SREs Built HCP's Incident Management Program

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

How HashiCorp SREs Built HCP's Incident Management Program HashiTalks 2022

Slide 3

Slide 3 text

Site Reliability Engineer at HashiCorp he/him @martinb3 Martin Smith

Slide 4

Slide 4 text

Building HashiCorp Cloud Platform. Standing on Terraform Cloud’s Shoulders 01 Getting started

Slide 5

Slide 5 text

HashiCorp’s writing culture is our superpower. We get to see how our colleagues think, emulate those processes, and get smarter ourselves. Margaret Gillette, Senior Director Talent Development

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

April 2021 October 2021 January 2021 February 2021 A timeline of the HashiCorp Cloud Platform July 2020 October 2020 April 2020 HCS Public Beta HCS GA HCP Consul Beta HCP Vault Beta HCP Consul GA HCP Vault GA HCP Packer Beta

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Severities Kickoﬀ During After Incident Management Facilitation Process to run Documentation/artifacts Action items Retrospective / Postmortem Monitoring/integrations Rotations Wake up/not wake up IaC On-call

Slide 10

Slide 10 text

Distribution, size, shape of team On-call rotations Team structure matters

Slide 11

Slide 11 text

We really care about humane incident culture But you can’t change it from the outside Culture comes from within

Slide 12

Slide 12 text

Support requests, weekly data processing, etc – keeping the lights on Resist that urge. On-call is creepy

Slide 13

Slide 13 text

But why does it matter And some more exposition or citation Blame is baggage

Slide 14

Slide 14 text

But why does it matter And some more exposition or citation Incidents have reasons

Slide 15

Slide 15 text

Not just for incidents anymore. Retro the IM process. Retro the retro process. Retro the on-call experience. All of it. Always be retro’ing

Slide 16

Slide 16 text

Tooling for incidents. It broke everything 02 Next steps

Slide 17

Slide 17 text

Form the incident team Page the right people Create a zoom room Make a Slack channel Ask people to post in Slack Build the retrospective doc Interrupt constantly Scribe what happened Schedule the retrospective Find a facilitator Nag everyone about everything Send emails The manual steps in running an incident 😭

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Site Reliability Engineer at HashiCorp he/him Michael Main

Slide 21

Slide 21 text

Learning More from Incidents. Reﬁning our process 03 Following Up

Slide 22

Slide 22 text

Change is Hard Especially when the stakes are high

Slide 23

Slide 23 text

Getting a Lay of the Land What we already had. Quantiﬁable data is already tracked by our incident management software. What we still need. Anything that the incident management software can’t know.

Slide 24

Slide 24 text

Tags. And also the problem with tags. More Data, More Burden. Tags are an easy way to attribute qualitative data, but they are also another step for incident responders to remember.

Slide 25

Slide 25 text

Getting the Timing Right Unburdening Incident Responders Retrospectives. Custom questions are a more engaging format for incident responders, and serving them after the incident concludes gives responders plenty of time to ﬁll in the blanks.

Slide 26

Slide 26 text

After the Incident. Retrospectives and Ops Review 04 Communicating

Slide 27

Slide 27 text

Retrospectives How our understanding changed

Slide 28

Slide 28 text

Summaries of Unusual Size? I don’t think they exist.

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Ops Review Another audience to consider Aggregating Incident Data. How can we provide a brief summary of each incident, but retain the ability to get into technical speciﬁcs if need be?

Slide 31

Slide 31 text

What the Future Holds. 04 Looking Ahead

Slide 32

Slide 32 text

Tooling. Have our needs changed? What else is out there?

Slide 33

Slide 33 text

Scaling Across Teams Standardization vs Customization Running incidents is easy except for when there’s an incident.

Slide 34

Slide 34 text

More Eﬀective Analysis Building on the data foundation to learn more Aggregate Analysis. What can we learn about our division or about the organization as a whole on a quarterly or annual basis? Team Analysis. How can we provide meaningful data on a team-by-team basis about incident response and service health over regular time periods? SLO Analysis. How well can we quantify incident impact against our SLOs and what conclusions can we draw about our SLOs as a result?

Slide 35

Slide 35 text

Thank You [email protected] | learn.hashicorp.com | discuss.hashicorp.com