Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AllDayDevOps: What the NTSB teaches us about incident management & postmortems

Michael
October 17, 2018

AllDayDevOps: What the NTSB teaches us about incident management & postmortems

The National Transport Safety Bureau is one of the most widely known Government bodies in the world. It’s their role to run into an incident, secure the scene and understand everything that happened. Given the important and unpredictable nature of their work, they have an extensive manual that sets out how incidents should be attended to and how the investigation should progress.

This session will detail how the NTSB’s approach to its work and the procedure that drives it, is transferable to us as incident responders. We’ll talk about the NTSB’s pre-incident preparation, incident notification, attending it, collecting information from the field and writing up a report and holding hearings. We’ll consistently draw parallels to IT incident management and how to create applicable process and procedures that mimic those of the NTSB.

Michael

October 17, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. What the NTSB teaches us about
    incident management & postmortems
    Michael Kehoe
    Staff Site Reliability Engineer

    View Slide

  2. Agenda and Vision

    View Slide

  3. Today’s
    agenda
    1 Introductions
    2 Background on the NTSB
    3 NTSB: Investigative Process
    4 Recommendations & Most Wanted List
    5 How this applies to us?
    6 Final thoughts

    View Slide

  4. Michael Kehoe
    $ WHOAMI
    • Staff Site Reliability Engineer @ LinkedIn
    • Production-SRE Team;
    • Disaster Recovery
    • Incident Response
    • Visibility Engineering
    • Reliability Principles
    • Find me online at:
    • @matrixtek
    • https://michael-kehoe.io
    • linkedin.com/in/michaelkkehoe

    View Slide

  5. Production-SRE Team @ LinkedIn
    $ /USR/BIN/WHOAMI
    ● Disaster Recovery - Planning & Automation
    ● Incident Response – Process & Automation
    ● Visibility Engineering – Making use of
    operational data
    ● Reliability Principles – Defining best practice
    & automating it

    View Slide

  6. Incident Command System (ICS)
    https://training.fema.gov/emiweb/is/icsresource/assets/reviewmaterials.pdf

    View Slide

  7. Background on the NTSB

    View Slide

  8. Background on the NTSB
    JURISDICTION
    ● Aviation
    ● Surface Transportation
    ● Marine
    ● Pipeline
    ● Assistance to other agencies/ governments

    View Slide

  9. “The NTSB shall investigate or have investigated and
    establish the facts, circumstances, and cause or
    probable cause of accidents…”
    U.S. Code § 1131

    View Slide

  10. “… The Board shall report on the facts and
    circumstances of each accident investigated…The
    Board shall make each report available to the public
    at reasonable cost…”
    U.S. Code § 1131

    View Slide

  11. “The NTSB does not assign fault or blame for an
    accident or incident…accident/incident
    investigations are fact-finding proceedings with no
    formal issues and no adverse parties … and are not
    conducted for the purpose of determining the rights
    or liabilities of any person.”
    U.S. Code § 1154

    View Slide

  12. Similar Organizations
    ● Italy –Agenzia nazionale per la
    Sicurezza del Volo (ANSV)
    ● Canada – Transportation Safety Board
    of Canada (TSB)
    ● Indonesia- Komite Nasional
    Keselamatan Transportasi (NTSC)
    ● Netherlands – Dutch Safety Board
    (DSB)
    ● Australia – Australian Transport Safety
    Bureau (ATSB)
    ● United Kingdom – Air Accidents
    Investigation Branch (AAIB)
    ● Germany – Bundesstelle für
    Flugunfalluntersuchung
    ● France –Bureau d’Enquetes et
    d’Analyses pour la Securite de
    l’Aviation Civile (BEA)

    View Slide

  13. NTSB Investigation Process

    View Slide

  14. NTSB Investigation Process
    1. Pre-Investigation Preparation
    2. Notification & Initial Response
    3. On-Scene Activities
    4. Post-On-Scene Activities

    View Slide

  15. 1. Pre-Investigation
    Preparation

    View Slide

  16. Pre-Investigation Preparation
    GO TEAM
    ● Go team: On call investigators ready for
    assignments
    ● Investigator-In-Change (IIC) pre-assigned
    ● Full Go team may contain several subject
    matter experts; e.g.
    ○ Human performance
    ○ Aircraft performance
    ○ Air Traffic Control

    View Slide

  17. Pre-Investigation Preparation
    GO TEAM ROSTER
    ● Oncall roster made available internally
    ○ Phone & Pager numbers
    ● Updated weekly
    ● All personnel should be able to arrive at an
    airport 2 hours after notification
    ○ Should have essentials on them if they
    live far away from an airport
    ● Division Chiefs responsible for testing pager

    View Slide

  18. 2. Notification & Initial
    Response

    View Slide

  19. Notification & Initial Response
    REGIONAL RESPONSE
    1. Regional office notifies headquarters of
    incident
    2. Closest regional office to accident will
    provide at least one investigator to perform
    PR & “stakedown”

    View Slide

  20. Notification & Initial Response
    HEADQUARTERS RESPONSE
    1. After incident occurs: communication center
    advises IIC and chief of Major Investigations
    (who subsequently inform their superiors)
    2. OAS director decides whether to launch a
    Go-Team
    3. Other executives are made aware by Chief of
    Major Investigations

    View Slide

  21. Notification & Initial Response
    NOTIFICATION & ASSIGNMENTS
    ● Go-Team composition determined by
    incident circumstances
    ● Send more specialists if in doubt

    View Slide

  22. Notification & Initial Response
    PARTY NOTIFICATION
    ● IIC gives party status to organizations that
    can provide technical assistance (airlines,
    aircraft manufacturers etc.)
    ● Communication center will help with travel
    arrangements and on-site administrative
    support
    ● Go-Team will travel together to accident site

    View Slide

  23. 3. On-Scene Activities

    View Slide

  24. On-Scene Activities
    COMMAND ROOMS
    ● Have meeting rooms to accommodate at least
    30 people
    ● Have space for media
    ● Ensure you have equipment in command
    room
    ○ PCs
    ○ Telephone systems
    ○ Forms
    ● IIC is responsible for managing this

    View Slide

  25. On-Scene Activities
    COMMAND ROOMS
    ● For Major investigations, Administrative
    support is provided
    ● Government purchase card is available for
    goods or services

    View Slide

  26. On-Scene Activities
    ORGANIZATIONAL MEETING
    ● Share preliminary information
    ● Organize (assign) participants
    ● Organize observers
    ● Establish lines of authority

    View Slide

  27. “The manner in which the IIC conducts the
    organizational meeting will establish the tone of the
    investigation. Therefore, the importance of being
    organized, articulate, assertive, composed, and
    understanding cannot be overstated”
    Major Investigations Manual Sec 3.2

    View Slide

  28. On-Scene Activities
    ACCIDENT SITE SAFETY PRECAUTIONS
    ● Safety officer identifies & classifies risks and
    then develops counter-measures
    ● Safety officer performs daily briefings to
    accident site team.

    View Slide

  29. On-Scene Activities
    OBSERVERS
    ● Observers may be allowed if they do not have
    self-interest
    ● May include:
    ○ Congressional oversight committee(s)
    ○ Military personnel
    ○ Foreign Governments
    ○ Federal Agencies

    View Slide

  30. On-Scene Activities
    LINE OF AUTHORITY
    ● IIC is the most senior person on-scene and all
    investigative activity is under his/ her control
    ● If IIC cannot resolve an issue, IIC may talk to
    Chief of Major Investigations
    ● Ability to escalate further if required

    View Slide

  31. On-Scene Activities
    PROGRESS MEETINGS
    ● On-site progress meetings are held daily to:
    ○ Disseminate information obtained
    ○ Plan the day’s activities
    ○ Discuss plans for subsequent
    investigative activities
    ● Generally start at 6pm
    ● Plan next day’s meeting

    View Slide

  32. On-Scene Activities
    DAILY ACTIVITIES OF IIC
    ● Headquarters briefing
    ● Safety board staff meeting
    ● Party coordinator meeting
    ● Site visit

    View Slide

  33. 4. Post-On-Scene Activities

    View Slide

  34. NTSB Report Structure
    Gathering facts
    about the incident
    Factual
    Information
    Extra information
    Appendices
    Analyze how the
    facts contribution to
    the incident
    Analysis
    Draw conclusions
    about what
    happened
    Conclusions
    Write detailed
    recommendations
    Recommendations

    View Slide

  35. Post-On-Scene Activities
    WORK PLANNING
    ● Discuss activities that will follow the on-scene
    phase of investigation
    ● Build timelines for work
    ● Provides avenues for various teams to work
    together

    View Slide

  36. Post-On-Scene Activities
    FACTS & ANALYSIS REPORT
    ● A factual report based on the field notes and
    subsequent investigation activities
    ● Each group chairman shall submit an analysis
    report based on the information contained in
    his or her factual report.

    View Slide

  37. Post-On-Scene Activities
    PUBLIC HEARING
    ● Led by IIC/ Hearing Officer
    ● Identify witnesses whose testimony is
    appropriate
    ● The witnesses may be from the parties to the
    investigation or can be suggested by one or
    more of the parties.
    ● Purpose: To ensure all relevant information is
    gathered before writing the report

    View Slide

  38. Post-On-Scene Activities
    TECHNICAL REVIEW
    ● Provides an additional opportunity for all
    parties to review all factual information
    ● Ensures all issues are resolved
    ● Technical Review is held as soon as possible
    after public hearing

    View Slide

  39. Post-On-Scene Activities
    PREPARATION OF FINAL REPORT
    ● Dedicated department to help write report
    ● Follows a standard template
    ○ Annex 13 to the International Civil
    Aviation Organization (ICAO)
    ● Contains formal recommendations to
    manufacturers/ transportation authorities

    View Slide

  40. Recommendations &
    Most Wanted List

    View Slide

  41. Recommendations & Most Wanted List
    ● NTSB advocates for particular action items
    based on report(s):
    ○ Generally directed towards Transport
    bodies/ manufacturers
    ● NTSB publicly tracks response of the
    responsible body
    https://www.ntsb.gov/safety/mwl/Pages/default.aspx

    View Slide

  42. How this relates to all of us?

    View Slide

  43. 1. Pre-Investigation
    Preparation

    View Slide

  44. Applying this to operations
    PRE-INCIDENT PREPARATION
    ● Have an Incident commander pre-assigned
    ● Publish on-call schedules
    ○ Manager is responsible
    ● Test on-call pagers regularly
    ● Ensure that you can respond within SLA
    ● Printed copy of Oncall contact info
    ● DR
    http://i.imgur.com/wvg8IDq.gif

    View Slide

  45. 2. Notification & Initial
    Response

    View Slide

  46. Applying this to operations
    NOTIFICATION & INITIAL RESPONSE
    ● NOC/ SiteOps teams notifies incident
    commander + manager
    ○ Prod-SRE gets engaged
    ● Prod-SRE Manager/Oncall
    ○ Access, Engage, Notify, Mitigate
    https://docs.microsoft.com/en-us/windows/uwp/design/shell/tiles-and-notifications/images/toast-mirroring.gif

    View Slide

  47. Applying this to operations
    NOTIFICATION & INITIAL RESPONSE
    ● Once verified, we launch full response for Major
    Incident
    ● Incident commander gives “party status” to
    observers
    ● Manager informs executives & PR
    ○ Periodic updates
    ● Mitigate
    http://www.roadrunneremaillogin.com/wp-content/uploads/2018/06/RoadRunner-Email.jpg

    View Slide

  48. 3. On-Scene Activities

    View Slide

  49. Applying this to operations
    ON-SCENE ACTIVITIES
    ● Private + Public slack work-channels
    ● IC is empowered to make decisions
    ● Organizational call to ensure:
    ○ Problem is understood
    ○ Area of investigations assigned
    http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif

    View Slide

  50. Applying this to operations
    ON-SCENE ACTIVITIES
    ● War room
    ○ Incident commander drives the war-
    room
    ○ Roles & responsibilities assigned to each
    “party”
    ○ Communication at regular cadence to
    execs
    ○ Admin ensures supplies and food
    ● Gathering data and updating timeline doc
    http://www.gpla.com/static/img/projects/ubisofts-e3-social-media-war-room/war-room.gif

    View Slide

  51. 4. Post-On-Scene Activities

    View Slide

  52. Applying this to operations
    POST ON-SCENE ACTIVITIES
    ● Post mortem
    ○ Dedicated team
    ○ PM Template
    ○ Blameless
    ● “Postmortem rollup”
    ○ Action items are prioritized
    ○ Weekly reporting on status of action-
    items
    https://www.economist.com/sites/default/files/imagecache/1280-width/20180414_OFP021.gif

    View Slide

  53. Recommendations:
    Most Wanted List

    View Slide

  54. Applying this to operations
    MOST WANTED LIST
    ● Use the post-incident process to improve
    and hold people accountable for action
    items
    ● Keep track of recurring issues/ repeaters
    https://clip2art.com/images/meeting-clipart-animated-gif-2.gif

    View Slide

  55. Final Thoughts

    View Slide

  56. Final Thoughts
    Complete Incident +
    Postmortem process
    NTSB Investigative
    Process
    The more you put in,
    the more you’ll get
    out
    Invest
    Accountability for
    improvements/
    action items
    Accountability

    View Slide

  57. Questions?

    View Slide

  58. View Slide