$30 off During Our Annual Pro Sale. View Details »

How HashiCorp SREs Built HCP's Incident Management Program

Martin Smith
March 08, 2022
27

How HashiCorp SREs Built HCP's Incident Management Program

Back in 2019 and 2020, HashiCorp’s Cloud SREs worked with our engineering teams to build an incident management program for the soon-to-be launched Cloud Platform. I’ll take you on a retrospective through how we initially approached this task, including a walk through the HashiCorp RFC process, as well as some of the questions we encountered along the way (for example, tooling selection challenges, or whether to use words like retrospective instead of postmortem). I’ll talk about how that process has continued to evolve into 2022, as well as what we think the future holds for incident management at HashiCorp.

Martin Smith

March 08, 2022
Tweet

Transcript

  1. View Slide

  2. How HashiCorp SREs
    Built HCP's Incident
    Management Program
    HashiTalks 2022

    View Slide

  3. Site Reliability Engineer at HashiCorp
    he/him
    @martinb3
    Martin Smith

    View Slide

  4. Building HashiCorp
    Cloud Platform.
    Standing on Terraform Cloud’s
    Shoulders
    01 Getting started

    View Slide

  5. HashiCorp’s writing culture is our
    superpower. We get to see how our
    colleagues think, emulate those processes,
    and get smarter ourselves.
    Margaret Gillette, Senior Director Talent
    Development

    View Slide

  6. View Slide

  7. April 2021
    October 2021
    January 2021
    February 2021
    A timeline of the HashiCorp Cloud Platform
    July 2020
    October 2020
    April 2020
    HCS Public
    Beta
    HCS GA
    HCP Consul
    Beta
    HCP Vault
    Beta
    HCP Consul
    GA
    HCP Vault
    GA
    HCP Packer
    Beta

    View Slide

  8. View Slide

  9. Severities
    Kickoff
    During
    After
    Incident
    Management
    Facilitation
    Process to run
    Documentation/artifacts
    Action items
    Retrospective /
    Postmortem
    Monitoring/integrations
    Rotations
    Wake up/not wake up
    IaC
    On-call

    View Slide

  10. Distribution, size, shape of team
    On-call rotations
    Team structure
    matters

    View Slide

  11. We really care about humane incident culture
    But you can’t change it from the outside
    Culture comes
    from within

    View Slide

  12. Support requests, weekly data processing, etc
    – keeping the lights on
    Resist that urge.
    On-call is creepy

    View Slide

  13. But why does it matter
    And some more exposition or citation
    Blame is baggage

    View Slide

  14. But why does it matter
    And some more exposition or citation
    Incidents have
    reasons

    View Slide

  15. Not just for incidents anymore.
    Retro the IM process. Retro the retro process.
    Retro the on-call experience. All of it.
    Always be
    retro’ing

    View Slide

  16. Tooling for
    incidents.
    It broke everything
    02 Next steps

    View Slide

  17. Form the incident team
    Page the right people
    Create a zoom room
    Make a Slack
    channel
    Ask people to post in Slack
    Build the retrospective doc
    Interrupt constantly
    Scribe what
    happened
    Schedule the retrospective
    Find a facilitator
    Nag everyone about everything
    Send emails
    The manual steps in running an incident 😭

    View Slide

  18. View Slide

  19. View Slide

  20. Site Reliability Engineer at HashiCorp
    he/him
    Michael Main

    View Slide

  21. Learning More
    from Incidents.
    Refining our process
    03 Following Up

    View Slide

  22. Change is Hard
    Especially when the stakes are high

    View Slide

  23. Getting a Lay of the Land
    What we already had.
    Quantifiable data is already tracked by our incident management software.
    What we still need.
    Anything that the incident management software can’t know.

    View Slide

  24. Tags.
    And also the problem with tags.
    More Data, More Burden.
    Tags are an easy way to attribute qualitative
    data, but they are also another step for
    incident responders to remember.

    View Slide

  25. Getting the Timing Right
    Unburdening Incident Responders
    Retrospectives.
    Custom questions are a more engaging format
    for incident responders, and serving them after
    the incident concludes gives responders plenty
    of time to fill in the blanks.

    View Slide

  26. After the Incident.
    Retrospectives and Ops Review
    04 Communicating

    View Slide

  27. Retrospectives
    How our understanding changed

    View Slide

  28. Summaries
    of Unusual
    Size?
    I don’t think they exist.

    View Slide

  29. View Slide

  30. Ops Review
    Another audience to consider
    Aggregating Incident Data.
    How can we provide a brief summary of each
    incident, but retain the ability to get into
    technical specifics if need be?

    View Slide

  31. What the Future
    Holds.
    04 Looking Ahead

    View Slide

  32. Tooling.
    Have our needs changed? What else is out there?

    View Slide

  33. Scaling Across Teams
    Standardization vs Customization
    Running incidents is easy except for when there’s an incident.

    View Slide

  34. More Effective Analysis
    Building on the data foundation to learn more
    Aggregate Analysis.
    What can we learn about our
    division or about the
    organization as a whole on a
    quarterly or annual basis?
    Team Analysis.
    How can we provide meaningful
    data on a team-by-team basis
    about incident response and
    service health over regular time
    periods?
    SLO Analysis.
    How well can we quantify
    incident impact against our
    SLOs and what conclusions can
    we draw about our SLOs as a
    result?

    View Slide

  35. Thank You
    [email protected] | learn.hashicorp.com | discuss.hashicorp.com

    View Slide