How HashiCorp SREs
Built HCP's Incident
Management Program
HashiTalks 2022
Slide 3
Slide 3 text
Site Reliability Engineer at HashiCorp
he/him
@martinb3
Martin Smith
Slide 4
Slide 4 text
Building HashiCorp
Cloud Platform.
Standing on Terraform Cloud’s
Shoulders
01 Getting started
Slide 5
Slide 5 text
HashiCorp’s writing culture is our
superpower. We get to see how our
colleagues think, emulate those processes,
and get smarter ourselves.
Margaret Gillette, Senior Director Talent
Development
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
April 2021
October 2021
January 2021
February 2021
A timeline of the HashiCorp Cloud Platform
July 2020
October 2020
April 2020
HCS Public
Beta
HCS GA
HCP Consul
Beta
HCP Vault
Beta
HCP Consul
GA
HCP Vault
GA
HCP Packer
Beta
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
Severities
Kickoff
During
After
Incident
Management
Facilitation
Process to run
Documentation/artifacts
Action items
Retrospective /
Postmortem
Monitoring/integrations
Rotations
Wake up/not wake up
IaC
On-call
Slide 10
Slide 10 text
Distribution, size, shape of team
On-call rotations
Team structure
matters
Slide 11
Slide 11 text
We really care about humane incident culture
But you can’t change it from the outside
Culture comes
from within
Slide 12
Slide 12 text
Support requests, weekly data processing, etc
– keeping the lights on
Resist that urge.
On-call is creepy
Slide 13
Slide 13 text
But why does it matter
And some more exposition or citation
Blame is baggage
Slide 14
Slide 14 text
But why does it matter
And some more exposition or citation
Incidents have
reasons
Slide 15
Slide 15 text
Not just for incidents anymore.
Retro the IM process. Retro the retro process.
Retro the on-call experience. All of it.
Always be
retro’ing
Slide 16
Slide 16 text
Tooling for
incidents.
It broke everything
02 Next steps
Slide 17
Slide 17 text
Form the incident team
Page the right people
Create a zoom room
Make a Slack
channel
Ask people to post in Slack
Build the retrospective doc
Interrupt constantly
Scribe what
happened
Schedule the retrospective
Find a facilitator
Nag everyone about everything
Send emails
The manual steps in running an incident 😭
Slide 18
Slide 18 text
No content
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
Site Reliability Engineer at HashiCorp
he/him
Michael Main
Slide 21
Slide 21 text
Learning More
from Incidents.
Refining our process
03 Following Up
Slide 22
Slide 22 text
Change is Hard
Especially when the stakes are high
Slide 23
Slide 23 text
Getting a Lay of the Land
What we already had.
Quantifiable data is already tracked by our incident management software.
What we still need.
Anything that the incident management software can’t know.
Slide 24
Slide 24 text
Tags.
And also the problem with tags.
More Data, More Burden.
Tags are an easy way to attribute qualitative
data, but they are also another step for
incident responders to remember.
Slide 25
Slide 25 text
Getting the Timing Right
Unburdening Incident Responders
Retrospectives.
Custom questions are a more engaging format
for incident responders, and serving them after
the incident concludes gives responders plenty
of time to fill in the blanks.
Slide 26
Slide 26 text
After the Incident.
Retrospectives and Ops Review
04 Communicating
Slide 27
Slide 27 text
Retrospectives
How our understanding changed
Slide 28
Slide 28 text
Summaries
of Unusual
Size?
I don’t think they exist.
Slide 29
Slide 29 text
No content
Slide 30
Slide 30 text
Ops Review
Another audience to consider
Aggregating Incident Data.
How can we provide a brief summary of each
incident, but retain the ability to get into
technical specifics if need be?
Slide 31
Slide 31 text
What the Future
Holds.
04 Looking Ahead
Slide 32
Slide 32 text
Tooling.
Have our needs changed? What else is out there?
Slide 33
Slide 33 text
Scaling Across Teams
Standardization vs Customization
Running incidents is easy except for when there’s an incident.
Slide 34
Slide 34 text
More Effective Analysis
Building on the data foundation to learn more
Aggregate Analysis.
What can we learn about our
division or about the
organization as a whole on a
quarterly or annual basis?
Team Analysis.
How can we provide meaningful
data on a team-by-team basis
about incident response and
service health over regular time
periods?
SLO Analysis.
How well can we quantify
incident impact against our
SLOs and what conclusions can
we draw about our SLOs as a
result?
Slide 35
Slide 35 text
Thank You
[email protected] | learn.hashicorp.com | discuss.hashicorp.com