Incident Response Patterns: What we have learned at PagerDuty

Incident Response Patterns: What we have learned at PagerDuty

This was a talk that I gave at Agile Conf 2015: http://sched.co/36Rc

Ebe1d126c7c859171156efb4c08db14f?s=128

Arup Chakrabarti

August 04, 2015
Tweet

Transcript

  1. Incident Response Patterns What we have learned at PagerDuty @arupchak

    #Agile2015 Arup Chakrabarti
  2. @arupchak #Agile2015 Who is this guy? • PagerDuty • Netflix • Amazon • A

    bunch of other stuff Incident Response Patterns: What we have learned at PagerDuty
  3. @arupchak #Agile2015 Quick Disclaimer I did not come up with

    everything Incident Response Patterns: What we have learned at PagerDuty •  I work with smart people •  Slides will be posted
  4. @arupchak #Agile2015 What is PagerDuty? Incident Response Patterns: What we

    have learned at PagerDuty
  5. @arupchak #Agile2015 Why are we well positioned? Incident Response Patterns:

    What we have learned at PagerDuty Oct 2014 US East Outage – Outgoing Traffic
  6. @arupchak #Agile2015 We get a lot of Incident Data Incident

    Response Patterns: What we have learned at PagerDuty
  7. @arupchak #Agile2015 What is Incident Response? Incident Response Patterns: What

    we have learned at PagerDuty
  8. @arupchak #Agile2015 What is Incident Response? Ability to react to

    events in a methodical and organized way Incident Response Patterns: What we have learned at PagerDuty
  9. @arupchak #Agile2015 Incident Response Patterns: What we have learned at

    PagerDuty DO NOT BE THIS GUY
  10. @arupchak #Agile2015 Why does Incident Response Matter? Incident Response Patterns:

    What we have learned at PagerDuty
  11. @arupchak #Agile2015 Why does Incident Response Matter? •  Expensive • 

    From devops.com •  For Fortune 1000 •  $100,000 per hour for infra •  $1mil per hour for app Incident Response Patterns: What we have learned at PagerDuty
  12. @arupchak #Agile2015 Why does Incident Response Matter? •  Customer Confidence

    •  Long outages •  Upset users Incident Response Patterns: What we have learned at PagerDuty
  13. @arupchak #Agile2015 Why does Incident Response Matter? •  Unhappy Engineers

    •  Outages are bad enough •  Disorganized outages are even worse Incident Response Patterns: What we have learned at PagerDuty
  14. @arupchak #Agile2015 Current State of The World Incident Response Patterns:

    What we have learned at PagerDuty •  Large Customers •  Large % of Engineers using PagerDuty •  Small Customers •  Small % of Engineers using PagerDuty •  Using PagerDuty for >2 yrs •  Grain of Salt
  15. @arupchak #Agile2015 Current State of The World Incident Response Patterns:

    What we have learned at PagerDuty 14.7 8.9 2.7 2.6 2.3 2.1 0 2 4 6 8 10 12 14 16 2010 2011 2012 2013 2014 2015 Largest 25 Customers Mean Time To Resolution (Hours)
  16. @arupchak #Agile2015 11.2 14.4 20.1 8.3 4.7 4.2 0 5

    10 15 20 25 2010 2011 2012 2013 2014 2015 Current State of The World Incident Response Patterns: What we have learned at PagerDuty 250 Smaller Customers MTTR (Hours)
  17. @arupchak #Agile2015 What does this mean? Incident Response Patterns: What

    we have learned at PagerDuty
  18. @arupchak #Agile2015 What does this mean? As an industry, we

    are getting better at Incident Response! Incident Response Patterns: What we have learned at PagerDuty
  19. @arupchak #Agile2015 Are we done then? Incident Response Patterns: What

    we have learned at PagerDuty
  20. @arupchak #Agile2015 Definitions for Rest of Talk Incident Response Patterns:

    What we have learned at PagerDuty •  Phases of an Incident •  Detection •  Triage •  Notification •  Resolution •  Post Mortem
  21. @arupchak #Agile2015 Detection Incident Response Patterns: What we have learned

    at PagerDuty
  22. @arupchak #Agile2015 Detection Incident Response Patterns: What we have learned

    at PagerDuty •  Best case – Self Detection •  Find out about problems before customers do •  Worst case – User Detection •  Customers calling and yelling
  23. @arupchak #Agile2015 Detection Incident Response Patterns: What we have learned

    at PagerDuty •  At PagerDuty •  DataDog •  SumoLogic •  Wormly •  New Relic •  Air Brake •  Monitis •  Crashlytics •  In house scripts •  … We use a lot of tools
  24. @arupchak #Agile2015 Detection Incident Response Patterns: What we have learned

    at PagerDuty •  At PagerDuty •  Try to use the right tool for the right detection type •  e.g. Time Series Data vs. Event Data
  25. @arupchak #Agile2015 Detection Incident Response Patterns: What we have learned

    at PagerDuty •  Time Series Data •  High sampling •  “How many logins in the last hour?” •  Event Data •  “Did my cron last night fail?”
  26. @arupchak #Agile2015 Detection Incident Response Patterns: What we have learned

    at PagerDuty •  Server Side Data •  Throwing 500’s •  Client Side Data •  Cannot resolve DNS
  27. @arupchak #Agile2015 Triage Incident Response Patterns: What we have learned

    at PagerDuty
  28. @arupchak #Agile2015 Triage Incident Response Patterns: What we have learned

    at PagerDuty •  Figure out customer impact •  Keyword: Customer •  Sliding scale, not binary •  Definitions are well understood
  29. @arupchak #Agile2015 Triage Incident Response Patterns: What we have learned

    at PagerDuty •  Bad Definition of Impact •  DB Server high CPU •  Good Definition of Impact •  Customer cannot checkout
  30. @arupchak #Agile2015 Triage Incident Response Patterns: What we have learned

    at PagerDuty •  Build Alerts that assess Business Impact
  31. @arupchak #Agile2015 Triage Incident Response Patterns: What we have learned

    at PagerDuty •  Make Dashboards Fast and Accessible •  Easier Decision Making
  32. @arupchak #Agile2015 Notification Incident Response Patterns: What we have learned

    at PagerDuty
  33. @arupchak #Agile2015 Notification Incident Response Patterns: What we have learned

    at PagerDuty •  Impact should reflect response •  Site Outage => Whole team •  Partial Outage => 1 Person •  No Outage => No one
  34. @arupchak #Agile2015 Notification Incident Response Patterns: What we have learned

    at PagerDuty •  Alerts needs to notify the right people •  Minimize the number of people •  Avoid “Spray and Pray”
  35. @arupchak #Agile2015 Notification Incident Response Patterns: What we have learned

    at PagerDuty Data Time
  36. @arupchak #Agile2015 Notification Incident Response Patterns: What we have learned

    at PagerDuty 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Top 25 Customers People Per Incident
  37. @arupchak #Agile2015 Notification Incident Response Patterns: What we have learned

    at PagerDuty How much sleep per night?
  38. @arupchak #Agile2015 Notification Incident Response Patterns: What we have learned

    at PagerDuty How much sleep per night? About 6 hours https://github.com/etsy/ opsweekly
  39. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty
  40. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty •  You have a problem •  You know the impact •  You have the people •  Now fix it
  41. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty •  Get people to the right place •  Dedicated Chat Channel •  Dedicated Conference Bridge
  42. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty Chat Graphs Bots
  43. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty •  Chat Rooms •  Treat it like an event ledger •  All actions go into Chat •  All decisions go into Chat
  44. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty •  Conference Bridges •  Incident Commander (IC) •  Good at process •  Facilitates communication •  NOT a resolver
  45. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty •  Incident Commander •  Use timed check-ins •  Use consistent language •  “Any strong objections?” •  “Give me a green/red status”
  46. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty •  Incident Deputy •  Acts as Scribe •  Support the IC •  External Communication
  47. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty •  External Communication •  Within the Company •  To Customers
  48. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty
  49. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty When to stop?
  50. @arupchak #Agile2015 Resolution Incident Response Patterns: What we have learned

    at PagerDuty When to stop? When the customer impact is no longer present
  51. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty
  52. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty •  Blameless •  Consistent •  Easy to find •  Trusted
  53. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty •  Blameless •  Assignment •  Language •  Tone
  54. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty •  Consistent •  Wiki Templates •  Event Timeline (use Chat!) •  People involved •  Customer impact •  Root Cause •  Resolution •  Follow up items
  55. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty •  Easy to Find •  Single Wiki Doc •  Single Folder •  Be as public as possible
  56. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty •  Trusted •  People trust the process •  People are not cynical •  Make sure it is lightweight
  57. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty Data Time
  58. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty Data Time How many teams actually resolve their root causes?
  59. @arupchak #Agile2015 Post Mortem Incident Response Patterns: What we have

    learned at PagerDuty Data Time How many teams actually resolve their root causes? 22% get resolved in a month
  60. @arupchak #Agile2015 Practice Incident Response Patterns: What we have learned

    at PagerDuty
  61. @arupchak #Agile2015 Practice Incident Response Patterns: What we have learned

    at PagerDuty •  Failure Friday •  Practice IC Role •  Practice Deputy Role •  Rotate
  62. @arupchak #Agile2015 Practice Incident Response Patterns: What we have learned

    at PagerDuty •  Incident Commander Training •  Ramp up new IC’s •  Spread knowledge throughout company •  Listen to old calls
  63. @arupchak #Agile2015 Practice Incident Response Patterns: What we have learned

    at PagerDuty •  Incident Response Lead •  Reviews Post Mortems •  Participates in outages •  Updates Outage Processes
  64. @arupchak #Agile2015 TL;DR Incident Response Patterns: What we have learned

    at PagerDuty •  Incident Response is Hard
  65. @arupchak #Agile2015 TL;DR Incident Response Patterns: What we have learned

    at PagerDuty •  Incident Response is Hard •  PagerDuty has been practicing
  66. @arupchak #Agile2015 TL;DR Incident Response Patterns: What we have learned

    at PagerDuty •  Incident Response is Hard •  PagerDuty has been practicing •  Better tooling is making this easier
  67. Thank you. We are Hiring pagerduty.com/jobs @arupchak

  68. @arupchak #Agile2015 Post-Mortem Template Incident Response Patterns: What we have

    learned at PagerDuty •  Overview – Brief description of the outage •  Root Cause and Resolution – What steps were taken to mitigate the impact •  Customer Impact – Measureable impact (e.g. 50% of customers could not login) •  Responders – Who was involved •  Timeline – Timeline of events •  What went well / What did not go well – Retrospective pieces •  Action Items/Follow-up tasks – Link to bug tracker with all of the follow-up items •  Internal Communication – Email communication that was sent to the rest of the company •  External Communication – External communication that was sent to customers