Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Red Button: How Stripe Automates Incident M...

Big Red Button: How Stripe Automates Incident Management (SF Women in Infrastructure)

When an incident starts, ten different things need to happen at once. You need to get an incident commander, you need to get all the right people in the room, you need to mitigate the incident, and you need to stay organized. At Stripe, we've built a tool for automating as much of the routine tasks as possible so responders can focus on what humans do best. In this talk, I'll show you the Big Red Button, a web form that sends emails, creates JIRA tickets, opens Slack channels, sends pages, and more. We'll talk about the unique constraints of this tool (such as, how much incident metadata do you ask for up-front?) and how our incident management philosophy influenced our design.

Amy Nguyen

June 25, 2018
Tweet

More Decks by Amy Nguyen

Other Decks in Technology

Transcript

  1. @amyngyn Women in Infrastructure Amy Nguyen Infrastructure Engineer, Stripe June

    25, 2018 Big Red Button How Stripe Automates Incident Management
  2. @amyngyn Women in Infrastructure What's Stripe? Who are you? •

    Stripe builds economic infrastructure for the Internet. • Security and reliability are the most important values we can provide to our users. • I'm an engineer on the Observability team. Find me at amynguyen.net. • This is my cat, Pumpkin. I've had him since middle school!
  3. @amyngyn Women in Infrastructure How do you declare an incident?

    • Page a team manually • Send an email to the whole company • Use @channel in Slack • Scream until someone hears you • Kick your CEO out of the biggest conference room in the office and declare it a warroom
  4. @amyngyn Women in Infrastructure What about all the little things

    you have to do? • Update your company's status pages (e.g., Twitter, Statuspage, RSS) • Inform stakeholders who are not remediating (e.g., legal, communications, security, account managers) • Send emails • Create a ticket to track the incident • Document the incident timeline • Announce where the remediation is happening (Slack channel? VC?)
  5. @amyngyn Women in Infrastructure • Randomly generated incident ID •

    For severe/user-facing incidents, automatically pages our communications team • Helps you find an incident PM, or incident commander (wait for Connie-Lynne's talk!) • As few questions as possible to help with incident panic
  6. @amyngyn Women in Infrastructure Design Considerations • Every task can

    fail and you must be able to handle it. Give clear instructions on how to manually perform each task. • Incident reporting must be as fast as possible. Don't slow down the reporter with questions that are not immediately important for remediation. • Over-communicate. Don't make the reporter second-guess who they need to contact. Contact them and figure it out later. • Do everything that can be automated. Let humans do what they do best. Find the tasks people are doing repetitively and do everything possible to lower the amount of time spent on incident response.
  7. @amyngyn Women in Infrastructure What's next? • How do we

    categorize incidents? • How do we make sure that all of the steps of the incident checklist have been followed? • How do we make the post-incident review process even easier? • Can we learn more about which services have a low bus factor?
  8. @amyngyn Women in Infrastructure artist: mintlodica (twitter / instagram) Special

    thanks to Kiran Bhattaram (@kiranb), Davin Bogan (@davinbogan), Andreas Fuchs (@antifuchs), Taleena Herkenhoff, Robert Pooley, and the rest of the Observability team at Stripe!