Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The No-Nonsense Guide to Runbook Best Practices

The No-Nonsense Guide to Runbook Best Practices

Runbooks are a key part of incident management and preserve institutional knowledge. They can be used for both incident response as well as routine tasks like db maintenance and generating a complex report. We are mostly focused on incident response runbooks here.

Hrishikesh Barua

November 04, 2024
Tweet

More Decks by Hrishikesh Barua

Other Decks in Technology

Transcript

  1. /02 Runbook Structure Establish a consistent format for runbooks across

    the organization. Get buy-in from the team on the chosen format. Structure runbooks as decision trees, keeping them concise. Each runbook should have a single purpose. incidenthub.cloud
  2. /03 Runbook Content Runbooks should provide clear, actionable instructions. Keep

    content concise and trim unnecessary details. Include relevant architecture diagrams and links to dashboards. Be aware of the "curse of knowledge" when writing. It's okay to have some manual steps in runbooks. incidenthub.cloud
  3. /04 Updating and Maintaining Runbooks Update runbooks after incidents based

    on observed issues. Coordinate with teams to ensure runbook updates happen. Note and fix any inaccuracies discovered during incidents. incidenthub.cloud
  4. /05 Testing Runbooks Test runbooks from a "clean" machine before

    deployment. Involve new hires and conduct mock incident exercises. Regularly test runbooks to ensure they work as expected. incidenthub.cloud
  5. /06 Store runbooks in a central, accessible location. Link alerts

    directly to relevant runbooks. Improve findability through descriptive naming and keywords. incidenthub.cloud Locating Runbooks
  6. /07 Runbook Ownership Service teams own the runbooks for their

    services. SRE/Ops team owns runbooks for infrastructure/common components. Encourage collective ownership and rotating update responsibilities. incidenthub.cloud
  7. /08 What Not To Do Avoid overly generic runbooks. Don't

    have more than one runbook per alert. Never store credentials in runbooks. incidenthub.cloud
  8. /09 Dealing With the Unexpected If no runbook exists, involve

    the service owner and document the steps. If runbook steps are wrong or inaccessible, do not execute them blindly. Understand system interactions before executing commands. Pull in subject matter experts if runbook steps are incorrect. incidenthub.cloud