Incident Response Patterns: What we have learned at PagerDuty

Incident Response Patterns What we have learned at PagerDuty @arupchak
#Agile2015 Arup Chakrabarti

@arupchak #Agile2015 Who is this guy? • PagerDuty • Netflix • Amazon • A
bunch of other stuff Incident Response Patterns: What we have learned at PagerDuty

@arupchak #Agile2015 Quick Disclaimer I did not come up with
everything Incident Response Patterns: What we have learned at PagerDuty •  I work with smart people •  Slides will be posted

@arupchak #Agile2015 What is PagerDuty? Incident Response Patterns: What we
have learned at PagerDuty

@arupchak #Agile2015 Why are we well positioned? Incident Response Patterns:
What we have learned at PagerDuty Oct 2014 US East Outage – Outgoing Traffic

@arupchak #Agile2015 We get a lot of Incident Data Incident
Response Patterns: What we have learned at PagerDuty

@arupchak #Agile2015 What is Incident Response? Incident Response Patterns: What
we have learned at PagerDuty

@arupchak #Agile2015 What is Incident Response? Ability to react to
events in a methodical and organized way Incident Response Patterns: What we have learned at PagerDuty

@arupchak #Agile2015 Incident Response Patterns: What we have learned at
PagerDuty DO NOT BE THIS GUY

@arupchak #Agile2015 Why does Incident Response Matter? Incident Response Patterns:
What we have learned at PagerDuty

@arupchak #Agile2015 Why does Incident Response Matter? •  Expensive • 
From devops.com •  For Fortune 1000 •  $100,000 per hour for infra •  $1mil per hour for app Incident Response Patterns: What we have learned at PagerDuty

@arupchak #Agile2015 Why does Incident Response Matter? •  Customer Confidence
•  Long outages •  Upset users Incident Response Patterns: What we have learned at PagerDuty

@arupchak #Agile2015 Why does Incident Response Matter? •  Unhappy Engineers
•  Outages are bad enough •  Disorganized outages are even worse Incident Response Patterns: What we have learned at PagerDuty

@arupchak #Agile2015 Current State of The World Incident Response Patterns:
What we have learned at PagerDuty •  Large Customers •  Large % of Engineers using PagerDuty •  Small Customers •  Small % of Engineers using PagerDuty •  Using PagerDuty for >2 yrs •  Grain of Salt

@arupchak #Agile2015 Current State of The World Incident Response Patterns:
What we have learned at PagerDuty 14.7 8.9 2.7 2.6 2.3 2.1 0 2 4 6 8 10 12 14 16 2010 2011 2012 2013 2014 2015 Largest 25 Customers Mean Time To Resolution (Hours)

@arupchak #Agile2015 11.2 14.4 20.1 8.3 4.7 4.2 0 5
10 15 20 25 2010 2011 2012 2013 2014 2015 Current State of The World Incident Response Patterns: What we have learned at PagerDuty 250 Smaller Customers MTTR (Hours)

@arupchak #Agile2015 What does this mean? Incident Response Patterns: What

@arupchak #Agile2015 What does this mean? As an industry, we
are getting better at Incident Response! Incident Response Patterns: What we have learned at PagerDuty

@arupchak #Agile2015 Are we done then? Incident Response Patterns: What

@arupchak #Agile2015 Definitions for Rest of Talk Incident Response Patterns:
What we have learned at PagerDuty •  Phases of an Incident •  Detection •  Triage •  Notification •  Resolution •  Post Mortem

@arupchak #Agile2015 Detection Incident Response Patterns: What we have learned
at PagerDuty

at PagerDuty •  Best case – Self Detection •  Find out about problems before customers do •  Worst case – User Detection •  Customers calling and yelling

at PagerDuty •  At PagerDuty •  DataDog •  SumoLogic •  Wormly •  New Relic •  Air Brake •  Monitis •  Crashlytics •  In house scripts •  … We use a lot of tools

at PagerDuty •  At PagerDuty •  Try to use the right tool for the right detection type •  e.g. Time Series Data vs. Event Data

at PagerDuty •  Time Series Data •  High sampling •  “How many logins in the last hour?” •  Event Data •  “Did my cron last night fail?”

at PagerDuty •  Server Side Data •  Throwing 500’s •  Client Side Data •  Cannot resolve DNS

@arupchak #Agile2015 Triage Incident Response Patterns: What we have learned
at PagerDuty

at PagerDuty •  Figure out customer impact •  Keyword: Customer •  Sliding scale, not binary •  Definitions are well understood

at PagerDuty •  Bad Definition of Impact •  DB Server high CPU •  Good Definition of Impact •  Customer cannot checkout

at PagerDuty •  Build Alerts that assess Business Impact

at PagerDuty •  Make Dashboards Fast and Accessible •  Easier Decision Making

@arupchak #Agile2015 Notification Incident Response Patterns: What we have learned
at PagerDuty

at PagerDuty •  Impact should reflect response •  Site Outage => Whole team •  Partial Outage => 1 Person •  No Outage => No one

at PagerDuty •  Alerts needs to notify the right people •  Minimize the number of people •  Avoid “Spray and Pray”

at PagerDuty Data Time

at PagerDuty 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Top 25 Customers People Per Incident

at PagerDuty How much sleep per night?

at PagerDuty How much sleep per night? About 6 hours https://github.com/etsy/ opsweekly

@arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned
at PagerDuty

at PagerDuty •  You have a problem •  You know the impact •  You have the people •  Now fix it

at PagerDuty •  Get people to the right place •  Dedicated Chat Channel •  Dedicated Conference Bridge

at PagerDuty Chat Graphs Bots

at PagerDuty •  Chat Rooms •  Treat it like an event ledger •  All actions go into Chat •  All decisions go into Chat

at PagerDuty •  Conference Bridges •  Incident Commander (IC) •  Good at process •  Facilitates communication •  NOT a resolver

at PagerDuty •  Incident Commander •  Use timed check-ins •  Use consistent language •  “Any strong objections?” •  “Give me a green/red status”

at PagerDuty •  Incident Deputy •  Acts as Scribe •  Support the IC •  External Communication

at PagerDuty •  External Communication •  Within the Company •  To Customers

at PagerDuty

at PagerDuty When to stop?

at PagerDuty When to stop? When the customer impact is no longer present

@arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have
learned at PagerDuty

learned at PagerDuty •  Blameless •  Consistent •  Easy to find •  Trusted

learned at PagerDuty •  Blameless •  Assignment •  Language •  Tone

learned at PagerDuty •  Consistent •  Wiki Templates •  Event Timeline (use Chat!) •  People involved •  Customer impact •  Root Cause •  Resolution •  Follow up items

learned at PagerDuty •  Easy to Find •  Single Wiki Doc •  Single Folder •  Be as public as possible

learned at PagerDuty •  Trusted •  People trust the process •  People are not cynical •  Make sure it is lightweight

learned at PagerDuty Data Time

learned at PagerDuty Data Time How many teams actually resolve their root causes?

learned at PagerDuty Data Time How many teams actually resolve their root causes? 22% get resolved in a month

@arupchak #Agile2015 Practice Incident Response Patterns: What we have learned
at PagerDuty

at PagerDuty •  Failure Friday •  Practice IC Role •  Practice Deputy Role •  Rotate

at PagerDuty •  Incident Commander Training •  Ramp up new IC’s •  Spread knowledge throughout company •  Listen to old calls

at PagerDuty •  Incident Response Lead •  Reviews Post Mortems •  Participates in outages •  Updates Outage Processes

@arupchak #Agile2015 TL;DR Incident Response Patterns: What we have learned
at PagerDuty •  Incident Response is Hard

at PagerDuty •  Incident Response is Hard •  PagerDuty has been practicing

at PagerDuty •  Incident Response is Hard •  PagerDuty has been practicing •  Better tooling is making this easier

Thank you. We are Hiring pagerduty.com/jobs @arupchak

@arupchak #Agile2015 Post-Mortem Template Incident Response Patterns: What we have
learned at PagerDuty •  Overview – Brief description of the outage •  Root Cause and Resolution – What steps were taken to mitigate the impact •  Customer Impact – Measureable impact (e.g. 50% of customers could not login) •  Responders – Who was involved •  Timeline – Timeline of events •  What went well / What did not go well – Retrospective pieces •  Action Items/Follow-up tasks – Link to bug tracker with all of the follow-up items •  Internal Communication – Email communication that was sent to the rest of the company •  External Communication – External communication that was sent to customers

Incident Response Patterns: What we have learne...

Incident Response Patterns: What we have learned at PagerDuty

More Decks by Arup Chakrabarti

Other Decks in Technology

Featured

Transcript