Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Response and the Incident Complexity Framework

Incident Response and the Incident Complexity Framework

Take a look into how Simple (https://simple.com) performs incident response using their Incident Complexity Framework.

Curt Micol

June 15, 2015
Tweet

Other Decks in Technology

Transcript

  1. Important Questions How do we proceed? • What is wrong?

    • Who is impacted? • When do we communicate to a wider audience? • Is the issue resolved?
  2. Success! Built a lot of confidence in our incident response

    capabilities. If you want to read more check out the Heroku blog.
  3. What did we learn? More than fits on this slide

    • Zero confidence in our new system due to timeline • Our response was almost solely focused on the engineering side of the company • Our communication during the event reached only a small number of employees • We didn’t have clearly defined methods for answering the important questions
  4. Not So Simple Product Many features to consider • We

    have partners • We have > 1 products: • Activity • Instant (money transfer) • Goals • And more… • Customers interact via web, mobile, and ATM’s • Requirements: Risk Management and Security
  5. Not So Simple Disruptions What happens when things go wrong?

    • ACH Transfers • Check Review • Card transactions • ATM transactions • Direct Deposits • Mobile, Web • Onboarding
  6. What do we want? • Any system MUST NOT be

    focused solely on engineering • Any system MUST NOT be a burden on responders (or the company) to implement and utilize • Any system MUST be usable, and in many cases, managed by teams across the company • Any system MUST increase confidence to the response of the incident • Any system MUST have a procedure for a dynamic team built to handle any severity of incident
  7. Incident Complexity How do we determine general impact? • How

    many teams are impacted? • Is there an immediate impact to internal customers? • Is there an immediate impact to external customers? • Is there an immediate impact to our partners?
  8. Our method for determining the base level response for any

    disruption. Incident Complexity Framework
  9. Complexity Levels • Five levels • Each complexity defines the

    expectations around response and resolution • Clear procedures for communicating to both internal and external customers • Determines assignments in the incident organization • Organization roles enable a feedback loop • As complexity increases, these expectations may include post-incident procedures
  10. Incident Complexity Three Properties • Incident complexity can never decrease

    • Incident complexities require an owner • Incident complexities are globally recognized
  11. Incident Command Organization • Command: • Responsible for management of

    the incident and response team. • Lead: Incident Commander • Communications: • Responsible for communications to and from internal and external customers. • Branches: Marketing, Customer Relations • Lead: Incident Signaller • Operations: • Responsible for the work to resolve the incident • Branches: Backend, Frontend, Integration, Infra • Lead: Incident Engineer
  12. Check Incident It might work like this… • Checks team

    finds a failure • Sent to Customer Relations Technical Team (CR Tech) • CR Tech creates an incident issue • Initial complexity assigned here (best guess) • Notifies Infrastructure Engineering • Roles assigned: IC, IS (standby), IE • Work to resolve issue, and potential escalation
  13. Something More Impactful? A Look at a Larger Incident •

    ACH Memo file arrived, but isn’t processing • First responder assigns complexity, creates issue • Integration trained as Level 4 Incident Commanders • Integration brings on an engineer (alert or IRC) • Integration assigns an IS • IS begins work on communications • Keeps track of customer contact • Messaging for Status and CR reps • Incident Organization iterate on issue/comms
  14. Incidents Are Common 156 identified incidents since December 2014 •

    Level 2 & Level 4 most common • Three Level 5’s: 2 planned, 1 unplanned • One week in April we identified 11 incidents: • Including 4 Level 3’s, 3 Level 4’s • Participation from 11 teams • One Level 4 spanned 8 teams
  15. Identify Impact Least severe, most severe • Find your most

    severe disruption: • What procedures do you want completed? • How do you want to communicate? • What is the impact? Who’s affected? • Now do the same for least severe • Find the commonalities • Fill in the gaps (Level 2 through Level 4) • Train people to handle coordination and communication • Find what works, but avoid making response a burden