Big Red Button: How Stripe Automates Incident Management (SF Women in Infrastructure)

@amyngyn Women in Infrastructure Amy Nguyen Infrastructure Engineer, Stripe June
25, 2018 Big Red Button How Stripe Automates Incident Management

@amyngyn Women in Infrastructure What's Stripe? Who are you? •
Stripe builds economic infrastructure for the Internet. • Security and reliability are the most important values we can provide to our users. • I'm an engineer on the Observability team. Find me at amynguyen.net. • This is my cat, Pumpkin. I've had him since middle school!

@amyngyn Women in Infrastructure How do you declare an incident?
• Page a team manually • Send an email to the whole company • Use @channel in Slack • Scream until someone hears you • Kick your CEO out of the biggest conference room in the office and declare it a warroom

@amyngyn Women in Infrastructure What about all the little things
you have to do? • Update your company's status pages (e.g., Twitter, Statuspage, RSS) • Inform stakeholders who are not remediating (e.g., legal, communications, security, account managers) • Send emails • Create a ticket to track the incident • Document the incident timeline • Announce where the remediation is happening (Slack channel? VC?)

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure Introducing Big Red Button

@amyngyn Women in Infrastructure • Randomly generated incident ID •
For severe/user-facing incidents, automatically pages our communications team • Helps you find an incident PM, or incident commander (wait for Connie-Lynne's talk!) • As few questions as possible to help with incident panic

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure Design Considerations • Every task can
fail and you must be able to handle it. Give clear instructions on how to manually perform each task. • Incident reporting must be as fast as possible. Don't slow down the reporter with questions that are not immediately important for remediation. • Over-communicate. Don't make the reporter second-guess who they need to contact. Contact them and figure it out later. • Do everything that can be automated. Let humans do what they do best. Find the tasks people are doing repetitively and do everything possible to lower the amount of time spent on incident response.

@amyngyn Women in Infrastructure What's next? • How do we
categorize incidents? • How do we make sure that all of the steps of the incident checklist have been followed? • How do we make the post-incident review process even easier? • Can we learn more about which services have a low bus factor?

@amyngyn Women in Infrastructure artist: mintlodica (twitter / instagram) Special
thanks to Kiran Bhattaram (@kiranb), Davin Bogan (@davinbogan), Andreas Fuchs (@antifuchs), Taleena Herkenhoff, Robert Pooley, and the rest of the Observability team at Stripe!

Big Red Button: How Stripe Automates Incident M...

Big Red Button: How Stripe Automates Incident Management (SF Women in Infrastructure)

Amy Nguyen

More Decks by Amy Nguyen

Other Decks in Technology

Featured

Transcript

@amyngyn Women in Infrastructure Amy Nguyen Infrastructure Engineer, Stripe June

@amyngyn Women in Infrastructure What's Stripe? Who are you? •

@amyngyn Women in Infrastructure How do you declare an incident?

@amyngyn Women in Infrastructure What about all the little things

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure Introducing Big Red Button

@amyngyn Women in Infrastructure • Randomly generated incident ID •

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure

@amyngyn Women in Infrastructure Design Considerations • Every task can

@amyngyn Women in Infrastructure What's next? • How do we

@amyngyn Women in Infrastructure artist: mintlodica (twitter / instagram) Special