From devops.com • For Fortune 1000 • $100,000 per hour for infra • $1mil per hour for app Incident Response Patterns: What we have learned at PagerDuty
What we have learned at PagerDuty • Large Customers • Large % of Engineers using PagerDuty • Small Customers • Small % of Engineers using PagerDuty • Using PagerDuty for >2 yrs • Grain of Salt
What we have learned at PagerDuty 14.7 8.9 2.7 2.6 2.3 2.1 0 2 4 6 8 10 12 14 16 2010 2011 2012 2013 2014 2015 Largest 25 Customers Mean Time To Resolution (Hours)
10 15 20 25 2010 2011 2012 2013 2014 2015 Current State of The World Incident Response Patterns: What we have learned at PagerDuty 250 Smaller Customers MTTR (Hours)
at PagerDuty • At PagerDuty • DataDog • SumoLogic • Wormly • New Relic • Air Brake • Monitis • Crashlytics • In house scripts • … We use a lot of tools
learned at PagerDuty • Overview – Brief description of the outage • Root Cause and Resolution – What steps were taken to mitigate the impact • Customer Impact – Measureable impact (e.g. 50% of customers could not login) • Responders – Who was involved • Timeline – Timeline of events • What went well / What did not go well – Retrospective pieces • Action Items/Follow-up tasks – Link to bug tracker with all of the follow-up items • Internal Communication – Email communication that was sent to the rest of the company • External Communication – External communication that was sent to customers