$30 off During Our Annual Pro Sale. View Details »

Big Red Button: How Stripe Automates Incident Management (SF Women in Infrastructure)

Big Red Button: How Stripe Automates Incident Management (SF Women in Infrastructure)

When an incident starts, ten different things need to happen at once. You need to get an incident commander, you need to get all the right people in the room, you need to mitigate the incident, and you need to stay organized. At Stripe, we've built a tool for automating as much of the routine tasks as possible so responders can focus on what humans do best. In this talk, I'll show you the Big Red Button, a web form that sends emails, creates JIRA tickets, opens Slack channels, sends pages, and more. We'll talk about the unique constraints of this tool (such as, how much incident metadata do you ask for up-front?) and how our incident management philosophy influenced our design.

Amy Nguyen

June 25, 2018
Tweet

More Decks by Amy Nguyen

Other Decks in Technology

Transcript

  1. @amyngyn Women in Infrastructure
    Amy Nguyen
    Infrastructure Engineer, Stripe
    June 25, 2018
    Big Red Button
    How Stripe Automates Incident Management

    View Slide

  2. @amyngyn Women in Infrastructure
    What's Stripe? Who are you?
    ● Stripe builds economic infrastructure
    for the Internet.
    ● Security and reliability are the most
    important values we can provide to
    our users.
    ● I'm an engineer on the Observability
    team. Find me at amynguyen.net.
    ● This is my cat, Pumpkin.
    I've had him since middle school!

    View Slide

  3. @amyngyn Women in Infrastructure
    How do you declare an incident?
    ● Page a team manually
    ● Send an email to the whole company
    ● Use @channel in Slack
    ● Scream until someone hears you
    ● Kick your CEO out of the biggest
    conference room in the office and
    declare it a warroom

    View Slide

  4. @amyngyn Women in Infrastructure
    What about all the little things you have to do?
    ● Update your company's status pages (e.g., Twitter, Statuspage, RSS)
    ● Inform stakeholders who are not remediating (e.g., legal, communications,
    security, account managers)
    ● Send emails
    ● Create a ticket to track the incident
    ● Document the incident timeline
    ● Announce where the remediation is happening (Slack channel? VC?)

    View Slide

  5. @amyngyn Women in Infrastructure

    View Slide

  6. @amyngyn Women in Infrastructure
    Introducing
    Big Red Button

    View Slide

  7. @amyngyn Women in Infrastructure
    ● Randomly generated incident ID
    ● For severe/user-facing incidents,
    automatically pages our
    communications team
    ● Helps you find an incident PM, or
    incident commander (wait for
    Connie-Lynne's talk!)
    ● As few questions as possible to
    help with incident panic

    View Slide

  8. @amyngyn Women in Infrastructure

    View Slide

  9. @amyngyn Women in Infrastructure

    View Slide

  10. @amyngyn Women in Infrastructure

    View Slide

  11. @amyngyn Women in Infrastructure

    View Slide

  12. @amyngyn Women in Infrastructure

    View Slide

  13. @amyngyn Women in Infrastructure

    View Slide

  14. @amyngyn Women in Infrastructure

    View Slide

  15. @amyngyn Women in Infrastructure

    View Slide

  16. @amyngyn Women in Infrastructure
    Design Considerations
    ● Every task can fail and you must be able to handle it.
    Give clear instructions on how to manually perform each task.
    ● Incident reporting must be as fast as possible.
    Don't slow down the reporter with questions that are not immediately important for
    remediation.
    ● Over-communicate.
    Don't make the reporter second-guess who they need to contact.
    Contact them and figure it out later.
    ● Do everything that can be automated. Let humans do what they do best.
    Find the tasks people are doing repetitively and do everything possible to lower the
    amount of time spent on incident response.

    View Slide

  17. @amyngyn Women in Infrastructure
    What's next?
    ● How do we categorize incidents?
    ● How do we make sure that all of the steps of the incident checklist have
    been followed?
    ● How do we make the post-incident review process even easier?
    ● Can we learn more about which services have a low bus factor?

    View Slide

  18. @amyngyn Women in Infrastructure
    artist: mintlodica (twitter / instagram)
    Special thanks to Kiran Bhattaram (@kiranb), Davin Bogan (@davinbogan), Andreas Fuchs
    (@antifuchs), Taleena Herkenhoff, Robert Pooley, and the rest of the Observability team at Stripe!

    View Slide