$30 off During Our Annual Pro Sale. View Details »

What the NTSB teaches us about incident management & postmortems

Michael
September 04, 2018

What the NTSB teaches us about incident management & postmortems

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. It’s their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress.

This session will detail how the NTSB’s approach to its work and the procedure that drives it, is transferable to us as incident responders. We’ll talk about the NTSB’s pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. We’ll consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.

Michael

September 04, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. What the NTSB teaches us about
    incident management & postmortems
    ​Jeff Weiner
    ​Chief Executive Officer
    ​Michael Kehoe
    ​Staff Site Reliability Engineer
    ​Nina Mushiana
    ​Sr Site Reliability Manager

    View Slide

  2. Agenda and Vision

    View Slide

  3. Today’s
    agenda
    1 Introductions
    2 Background on the NTSB
    3 NTSB: Investigative Process
    4 Recommendations & Most Wanted List
    5 How this applies to us?
    6 Final thoughts

    View Slide

  4. Michael Kehoe
    ​$ /USR/BIN/WHOAMI
    ● Staff Site Reliability Engineer @ LinkedIn
    ● Production-SRE Team
    ● Funny accent = Australian + 4 years American

    View Slide

  5. Nina Mushiana
    ​$ /USR/BIN/WHOAMI
    ● Sr Site Reliability Engineer Manager @ LinkedIn
    ● Production-SRE Team & Site-Ops

    View Slide

  6. Production-SRE Team @ LinkedIn
    ​$ /USR/BIN/WHOAMI
    ● Disaster Recovery - Planning & Automation
    ● Incident Response – Process & Automation
    ● Visibility Engineering – Making use of
    operational data
    ● Reliability Principles – Defining best practice &
    automating it

    View Slide

  7. Incident Command System (ICS)
    https://training.fema.gov/emiweb/is/icsresource/assets/reviewmaterials.pdf

    View Slide

  8. Background on the NTSB

    View Slide

  9. Background on the NTSB
    ​JURISDICTION
    ● Aviation
    ● Surface Transportation
    ● Marine
    ● Pipeline
    ● Assistance to other agencies/ governments

    View Slide

  10. “The NTSB shall investigate or have investigated and
    establish the facts, circumstances, and cause or
    probable cause of accidents…”
    U.S. Code § 1131

    View Slide

  11. “… The Board shall report on the facts and
    circumstances of each accident investigated…The
    Board shall make each report available to the public
    at reasonable cost…”
    U.S. Code § 1131

    View Slide

  12. “The NTSB does not assign fault or blame for an
    accident or incident…accident/incident
    investigations are fact-finding proceedings with no
    formal issues and no adverse parties … and are not
    conducted for the purpose of determining the rights
    or liabilities of any person.”
    U.S. Code § 1154

    View Slide

  13. Similar Organizations
    ● Italy –Agenzia nazionale per la
    Sicurezza del Volo (ANSV)
    ● Canada – Transportation Safety Board
    of Canada (TSB)
    ● Indonesia- Komite Nasional
    Keselamatan Transportasi (NTSC)
    ● Netherlands – Dutch Safety Board
    (DSB)
    ● Australia – Australian Transport Safety
    Bureau (ATSB)
    ● United Kingdom – Air Accidents
    Investigation Branch (AAIB)
    ● Germany – Bundesstelle für
    Flugunfalluntersuchung
    ● France –Bureau d’Enquetes et
    d’Analyses pour la Securite de
    l’Aviation Civile (BEA)

    View Slide

  14. NTSB Investigation Process

    View Slide

  15. NTSB Investigation Process
    1. Pre-Investigation Preparation
    2. Notification & Initial Response
    3. On-Scene Activities
    4. Post-On-Scene Activities

    View Slide

  16. 1. Pre-Investigation
    Preparation

    View Slide

  17. Pre-Investigation Preparation
    ​GO TEAM
    ● Go team: On call investigators ready for
    assignments
    ● Investigator-In-Change (IIC) pre-assigned
    ● Full Go team may contain several subject
    matter experts; e.g.
    ○ Human performance
    ○ Aircraft performance
    ○ Air Traffic Control

    View Slide

  18. Pre-Investigation Preparation
    ​GO TEAM ROSTER
    ● Oncall roster made available internally
    ○ Phone & Pager numbers
    ● Updated weekly
    ● All personnel should be able to arrive at an
    airport 2 hours after notification
    ○ Should have essentials on them if they
    live far away from an airport
    ● Division Chiefs responsible for testing pager

    View Slide

  19. 2. Notification & Initial
    Response

    View Slide

  20. Notification & Initial Response
    ​REGIONAL RESPONSE
    1. Regional office notifies headquarters of
    incident
    2. Closest regional office to accident will
    provide at least one investigator to perform
    PR & “stakedown”

    View Slide

  21. Notification & Initial Response
    ​HEADQUARTERS RESPONSE
    1. After incident occurs: communication center
    advises IIC and chief of Major Investigations
    (who subsequently inform their superiors)
    2. OAS director decides whether to launch a
    Go-Team
    3. Other executives are made aware by Chief of
    Major Investigations

    View Slide

  22. Notification & Initial Response
    ​NOTIFICATION & ASSIGNMENTS
    ● Go-Team composition determined by
    incident circumstances
    ● Send more specialists if in doubt

    View Slide

  23. Notification & Initial Response
    ​PARTY NOTIFICATION
    ● IIC gives party status to organizations that
    can provide technical assistance (airlines,
    aircraft manufacturers etc.)
    ● Communication center will help with travel
    arrangements and on-site administrative
    support
    ● Go-Team will travel together to accident site

    View Slide

  24. 3. On-Scene Activities

    View Slide

  25. On-Scene Activities
    ​COMMAND ROOMS
    ● Have meeting rooms to accommodate at least
    30 people
    ● Have space for media
    ● Ensure you have equipment in command
    room
    ○ PCs
    ○ Telephone systems
    ○ Forms
    ● IIC is responsible for managing this

    View Slide

  26. On-Scene Activities
    ​COMMAND ROOMS
    ● For Major investigations, Administrative
    support is provided
    ● Government purchase card is available for
    goods or services

    View Slide

  27. On-Scene Activities
    ​ORGANIZATIONAL MEETING
    ● Share preliminary information
    ● Organize (assign) participants
    ● Organize observers
    ● Establish lines of authority

    View Slide

  28. “The manner in which the IIC conducts the
    organizational meeting will establish the tone of the
    investigation. Therefore, the importance of being
    organized, articulate, assertive, composed, and
    understanding cannot be overstated”
    Major Investigations Manual Sec 3.2

    View Slide

  29. On-Scene Activities
    ​ACCIDENT SITE SAFETY PRECAUTIONS
    ● Safety officer identifies & classifies risks and
    then develops counter-measures
    ● Safety officer performs daily briefings to
    accident site team.

    View Slide

  30. On-Scene Activities
    ​OBSERVERS
    ● Observers may be allowed if they do not have
    self-interest
    ● May include:
    ○ Congressional oversight committee(s)
    ○ Military personnel
    ○ Foreign Governments
    ○ Federal Agencies

    View Slide

  31. On-Scene Activities
    ​LINE OF AUTHORITY
    ● IIC is the most senior person on-scene and all
    investigative activity is under his/ her control
    ● If IIC cannot resolve an issue, IIC may talk to
    Chief of Major Investigations
    ● Ability to escalate further if required

    View Slide

  32. On-Scene Activities
    ​PROGRESS MEETINGS
    ● On-site progress meetings are held daily to:
    ○ Disseminate information obtained
    ○ Plan the day’s activities
    ○ Discuss plans for subsequent
    investigative activities
    ● Generally start at 6pm
    ● Plan next day’s meeting

    View Slide

  33. On-Scene Activities
    ​DAILY ACTIVITIES OF IIC
    ● Headquarters briefing
    ● Safety board staff meeting
    ● Party coordinator meeting
    ● Site visit

    View Slide

  34. 4. Post-On-Scene Activities

    View Slide

  35. NTSB Report Structure
    Gathering facts
    about the incident
    Factual
    Information
    Extra information
    Appendices
    Analyze how the
    facts contribution to
    the incident
    Analysis
    Draw conclusions
    about what
    happened
    Conclusions
    Write detailed
    recommendations
    Recommendation
    s

    View Slide

  36. Post-On-Scene Activities
    ​WORK PLANNING
    ● Discuss activities that will follow the on-scene
    phase of investigation
    ● Build timelines for work
    ● Provides avenues for various teams to work
    together

    View Slide

  37. Post-On-Scene Activities
    ​FACTS & ANALYSIS REPORT
    ● A factual report based on the field notes and
    subsequent investigation activities
    ● Each group chairman shall submit an analysis
    report based on the information contained in
    his or her factual report.

    View Slide

  38. Post-On-Scene Activities
    ​PUBLIC HEARING
    ● Led by IIC/ Hearing Officer
    ● Identify witnesses whose testimony is
    appropriate
    ● The witnesses may be from the parties to the
    investigation or can be suggested by one or
    more of the parties.
    ● Purpose: To ensure all relevant information is
    gathered before writing the report

    View Slide

  39. Post-On-Scene Activities
    ​TECHNICAL REVIEW
    ● Provides an additional opportunity for all
    parties to review all factual information
    ● Ensures all issues are resolved
    ● Technical Review is held as soon as possible
    after public hearing

    View Slide

  40. Post-On-Scene Activities
    ​PREPARATION OF FINAL REPORT
    ● Dedicated department to help write report
    ● Follows a standard template
    ○ Annex 13 to the International Civil
    Aviation Organization (ICAO)
    ● Contains formal recommendations to
    manufacturers/ transportation authorities

    View Slide

  41. Recommendations &
    Most Wanted List

    View Slide

  42. Recommendations & Most Wanted List
    ● NTSB advocates for particular action items
    based on report(s):
    ○ Generally directed towards Transport
    bodies/ manufacturers
    ● NTSB publicly tracks response of the
    responsible body
    https://www.ntsb.gov/safety/mwl/Pages/default.aspx

    View Slide

  43. How this relates to all of us?

    View Slide

  44. 1. Pre-Investigation
    Preparation

    View Slide

  45. Applying this to operations
    ​PRE-INCIDENT PREPARATION
    ● Have an Incident commander pre-assigned
    ● Publish on-call schedules
    ○ Manager is responsible
    ● Test on-call pagers regularly
    ● Ensure that you can respond within SLA
    ● Printed copy of Oncall contact info
    ● DR
    http://i.imgur.com/wvg8IDq.gif

    View Slide

  46. 2. Notification & Initial
    Response

    View Slide

  47. Applying this to operations
    ​NOTIFICATION & INITIAL RESPONSE
    ● NOC/ SiteOps teams notifies incident
    commander + manager
    ○ Prod-SRE gets engaged
    ● Prod-SRE Manager/Oncall
    ○ Access, Engage, Notify, Mitigate
    https://docs.microsoft.com/en-us/windows/uwp/design/shell/tiles-and-notifications/images/toast-mirroring.gif

    View Slide

  48. Applying this to operations
    ​NOTIFICATION & INITIAL RESPONSE
    ● Once verified, we launch full response for Major
    Incident
    ● Incident commander gives “party status” to
    observers
    ● Manager informs executives & PR
    ○ Periodic updates
    ● Mitigate
    http://www.roadrunneremaillogin.com/wp-content/uploads/2018/06/RoadRunner-Email.jpg

    View Slide

  49. 3. On-Scene Activities

    View Slide

  50. Applying this to operations
    ​ON-SCENE ACTIVITIES
    ● Private + Public slack work-channels
    ● IC is empowered to make decisions
    ● Organizational call to ensure:
    ○ Problem is understood
    ○ Area of investigations assigned
    http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif

    View Slide

  51. Applying this to operations
    ​ON-SCENE ACTIVITIES
    ● War room
    ○ Incident commander drives the
    war-room
    ○ Roles & responsibilities assigned to each
    “party”
    ○ Communication at regular cadence to
    execs
    ○ Admin ensures supplies and food
    ● Gathering data and updating timeline doc
    http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif

    View Slide

  52. 4. Post-On-Scene Activities

    View Slide

  53. Applying this to operations
    ​POST ON-SCENE ACTIVITIES
    ● Post mortem
    ○ Dedicated team
    ○ PM Template
    ○ Blameless
    ● “Postmortem rollup”
    ○ Action items are prioritized
    ○ Weekly reporting on status of
    action-items
    https://www.economist.com/sites/default/files/imagecache/1280-width/20180414_OFP021.gif

    View Slide

  54. Recommendations:
    Most Wanted List

    View Slide

  55. Applying this to operations
    ​MOST WANTED LIST
    ● Use the post-incident process to improve
    and hold people accountable for action
    items
    ● Keep track of recurring issues/ repeaters
    https://clip2art.com/images/meeting-clipart-animated-gif-2.gif

    View Slide

  56. Final Thoughts

    View Slide

  57. Final Thoughts
    Complete Incident +
    Postmortem process
    NTSB Investigative
    Process
    The more you put in,
    the more you’ll get
    out
    Invest
    Accountability for
    improvements/
    action items
    Accountability

    View Slide

  58. Questions?

    View Slide

  59. View Slide