Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AllDayDevOps: What the NTSB teaches us about incident management & postmortems

Michael
October 17, 2018

AllDayDevOps: What the NTSB teaches us about incident management & postmortems

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. It’s their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress.

This session will detail how the NTSB’s approach to its work and the procedure that drives it, is transferable to us as incident responders. We’ll talk about the NTSB’s pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. We’ll consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.

Michael

October 17, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. What the NTSB teaches us about incident management & postmortems

    Michael Kehoe Staff Site Reliability Engineer
  2. Today’s agenda 1 Introductions 2 Background on the NTSB 3

    NTSB: Investigative Process 4 Recommendations & Most Wanted List 5 How this applies to us? 6 Final thoughts
  3. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @

    LinkedIn • Production-SRE Team; • Disaster Recovery • Incident Response • Visibility Engineering • Reliability Principles • Find me online at: • @matrixtek • https://michael-kehoe.io • linkedin.com/in/michaelkkehoe
  4. Production-SRE Team @ LinkedIn $ /USR/BIN/WHOAMI • Disaster Recovery -

    Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  5. Background on the NTSB JURISDICTION • Aviation • Surface Transportation

    • Marine • Pipeline • Assistance to other agencies/ governments
  6. “The NTSB shall investigate or have investigated and establish the

    facts, circumstances, and cause or probable cause of accidents…” U.S. Code § 1131
  7. “… The Board shall report on the facts and circumstances

    of each accident investigated…The Board shall make each report available to the public at reasonable cost…” U.S. Code § 1131
  8. “The NTSB does not assign fault or blame for an

    accident or incident…accident/incident investigations are fact-finding proceedings with no formal issues and no adverse parties … and are not conducted for the purpose of determining the rights or liabilities of any person.” U.S. Code § 1154
  9. Similar Organizations • Italy –Agenzia nazionale per la Sicurezza del

    Volo (ANSV) • Canada – Transportation Safety Board of Canada (TSB) • Indonesia- Komite Nasional Keselamatan Transportasi (NTSC) • Netherlands – Dutch Safety Board (DSB) • Australia – Australian Transport Safety Bureau (ATSB) • United Kingdom – Air Accidents Investigation Branch (AAIB) • Germany – Bundesstelle für Flugunfalluntersuchung • France –Bureau d’Enquetes et d’Analyses pour la Securite de l’Aviation Civile (BEA)
  10. NTSB Investigation Process 1. Pre-Investigation Preparation 2. Notification & Initial

    Response 3. On-Scene Activities 4. Post-On-Scene Activities
  11. Pre-Investigation Preparation GO TEAM • Go team: On call investigators

    ready for assignments • Investigator-In-Change (IIC) pre-assigned • Full Go team may contain several subject matter experts; e.g. ◦ Human performance ◦ Aircraft performance ◦ Air Traffic Control
  12. Pre-Investigation Preparation GO TEAM ROSTER • Oncall roster made available

    internally ◦ Phone & Pager numbers • Updated weekly • All personnel should be able to arrive at an airport 2 hours after notification ◦ Should have essentials on them if they live far away from an airport • Division Chiefs responsible for testing pager
  13. Notification & Initial Response REGIONAL RESPONSE 1. Regional office notifies

    headquarters of incident 2. Closest regional office to accident will provide at least one investigator to perform PR & “stakedown”
  14. Notification & Initial Response HEADQUARTERS RESPONSE 1. After incident occurs:

    communication center advises IIC and chief of Major Investigations (who subsequently inform their superiors) 2. OAS director decides whether to launch a Go-Team 3. Other executives are made aware by Chief of Major Investigations
  15. Notification & Initial Response NOTIFICATION & ASSIGNMENTS • Go-Team composition

    determined by incident circumstances • Send more specialists if in doubt
  16. Notification & Initial Response PARTY NOTIFICATION • IIC gives party

    status to organizations that can provide technical assistance (airlines, aircraft manufacturers etc.) • Communication center will help with travel arrangements and on-site administrative support • Go-Team will travel together to accident site
  17. On-Scene Activities COMMAND ROOMS • Have meeting rooms to accommodate

    at least 30 people • Have space for media • Ensure you have equipment in command room ◦ PCs ◦ Telephone systems ◦ Forms • IIC is responsible for managing this
  18. On-Scene Activities COMMAND ROOMS • For Major investigations, Administrative support

    is provided • Government purchase card is available for goods or services
  19. On-Scene Activities ORGANIZATIONAL MEETING • Share preliminary information • Organize

    (assign) participants • Organize observers • Establish lines of authority
  20. “The manner in which the IIC conducts the organizational meeting

    will establish the tone of the investigation. Therefore, the importance of being organized, articulate, assertive, composed, and understanding cannot be overstated” Major Investigations Manual Sec 3.2
  21. On-Scene Activities ACCIDENT SITE SAFETY PRECAUTIONS • Safety officer identifies

    & classifies risks and then develops counter-measures • Safety officer performs daily briefings to accident site team.
  22. On-Scene Activities OBSERVERS • Observers may be allowed if they

    do not have self-interest • May include: ◦ Congressional oversight committee(s) ◦ Military personnel ◦ Foreign Governments ◦ Federal Agencies
  23. On-Scene Activities LINE OF AUTHORITY • IIC is the most

    senior person on-scene and all investigative activity is under his/ her control • If IIC cannot resolve an issue, IIC may talk to Chief of Major Investigations • Ability to escalate further if required
  24. On-Scene Activities PROGRESS MEETINGS • On-site progress meetings are held

    daily to: ◦ Disseminate information obtained ◦ Plan the day’s activities ◦ Discuss plans for subsequent investigative activities • Generally start at 6pm • Plan next day’s meeting
  25. On-Scene Activities DAILY ACTIVITIES OF IIC • Headquarters briefing •

    Safety board staff meeting • Party coordinator meeting • Site visit
  26. NTSB Report Structure Gathering facts about the incident Factual Information

    Extra information Appendices Analyze how the facts contribution to the incident Analysis Draw conclusions about what happened Conclusions Write detailed recommendations Recommendations
  27. Post-On-Scene Activities WORK PLANNING • Discuss activities that will follow

    the on-scene phase of investigation • Build timelines for work • Provides avenues for various teams to work together
  28. Post-On-Scene Activities FACTS & ANALYSIS REPORT • A factual report

    based on the field notes and subsequent investigation activities • Each group chairman shall submit an analysis report based on the information contained in his or her factual report.
  29. Post-On-Scene Activities PUBLIC HEARING • Led by IIC/ Hearing Officer

    • Identify witnesses whose testimony is appropriate • The witnesses may be from the parties to the investigation or can be suggested by one or more of the parties. • Purpose: To ensure all relevant information is gathered before writing the report
  30. Post-On-Scene Activities TECHNICAL REVIEW • Provides an additional opportunity for

    all parties to review all factual information • Ensures all issues are resolved • Technical Review is held as soon as possible after public hearing
  31. Post-On-Scene Activities PREPARATION OF FINAL REPORT • Dedicated department to

    help write report • Follows a standard template ◦ Annex 13 to the International Civil Aviation Organization (ICAO) • Contains formal recommendations to manufacturers/ transportation authorities
  32. Recommendations & Most Wanted List • NTSB advocates for particular

    action items based on report(s): ◦ Generally directed towards Transport bodies/ manufacturers • NTSB publicly tracks response of the responsible body https://www.ntsb.gov/safety/mwl/Pages/default.aspx
  33. Applying this to operations PRE-INCIDENT PREPARATION • Have an Incident

    commander pre-assigned • Publish on-call schedules ◦ Manager is responsible • Test on-call pagers regularly • Ensure that you can respond within SLA • Printed copy of Oncall contact info • DR http://i.imgur.com/wvg8IDq.gif
  34. Applying this to operations NOTIFICATION & INITIAL RESPONSE • NOC/

    SiteOps teams notifies incident commander + manager ◦ Prod-SRE gets engaged • Prod-SRE Manager/Oncall ◦ Access, Engage, Notify, Mitigate https://docs.microsoft.com/en-us/windows/uwp/design/shell/tiles-and-notifications/images/toast-mirroring.gif
  35. Applying this to operations NOTIFICATION & INITIAL RESPONSE • Once

    verified, we launch full response for Major Incident • Incident commander gives “party status” to observers • Manager informs executives & PR ◦ Periodic updates • Mitigate http://www.roadrunneremaillogin.com/wp-content/uploads/2018/06/RoadRunner-Email.jpg
  36. Applying this to operations ON-SCENE ACTIVITIES • Private + Public

    slack work-channels • IC is empowered to make decisions • Organizational call to ensure: ◦ Problem is understood ◦ Area of investigations assigned http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  37. Applying this to operations ON-SCENE ACTIVITIES • War room ◦

    Incident commander drives the war- room ◦ Roles & responsibilities assigned to each “party” ◦ Communication at regular cadence to execs ◦ Admin ensures supplies and food • Gathering data and updating timeline doc http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif
  38. Applying this to operations POST ON-SCENE ACTIVITIES • Post mortem

    ◦ Dedicated team ◦ PM Template ◦ Blameless • “Postmortem rollup” ◦ Action items are prioritized ◦ Weekly reporting on status of action- items https://www.economist.com/sites/default/files/imagecache/1280-width/20180414_OFP021.gif
  39. Applying this to operations MOST WANTED LIST • Use the

    post-incident process to improve and hold people accountable for action items • Keep track of recurring issues/ repeaters https://clip2art.com/images/meeting-clipart-animated-gif-2.gif
  40. Final Thoughts Complete Incident + Postmortem process NTSB Investigative Process

    The more you put in, the more you’ll get out Invest Accountability for improvements/ action items Accountability