Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Learning from Firefighters

Learning from Firefighters

In this presentation, I explain what we can learn from wilderness firefighters about increasing the reliability of our systems.

This version of the talk was given at the 2024 Edition of Open Source Summit Europe as part of the "Cloud Open" track.

Kerim Satirli

September 18, 2024
Tweet

More Decks by Kerim Satirli

Other Decks in How-to & DIY

Transcript

  1. • logs, metrics, traces • outliers and predictions • track

    missing data Order #1 Keep informed on system conditions and forecasts.
  2. Know what your system is doing at all times. •

    architecture • external factors • (actual) weather Order #2
  3. Base all actions on current and expected behavior of the

    system. • chaos testing • game days • incident retrospectives Order #3
  4. • message queues • degrade gracefully • communicate early Order

    #4 Identify minimum reliability levels and make them known.
  5. • monitor honey tokens • guard against insiders • possibly

    bait attackers Order #5 Post honeypots when there is possible danger.
  6. Be alert. Keep calm. Think clearly. Act decisively. • rotate

    roles consistently • trust your training • know your plays Order #6
  7. Maintain prompt comms with your own and adjoining teams. •

    organization • corp comms • customers Order #7
  8. • verify comprehension • document choices • buddy system Order

    #8 Give clear instructions and ensure they are understood.
  9. • avoid tunnel vision • don't overextend • accuracy beats

    expediency Order #9 Maintain control of your systems at all times.
  10. • restore supported services • collect forensics • keep taking

    notes Order #10 Protect systems aggressively, having provided for safety first.
  11. systems reliability is a product of structured command and control

    structured command and control requires situational awareness
  12. systems reliability is a product of structured command and control

    situational awareness requires strong communication structured command and control requires situational awareness systems reliability is a product of structured command and control structured command and control requires situational awareness
  13. Incident Management for Operations Schnepp Vidal, Hawley 2017 Fatal Defect

    Chasing Killer Computer Bugs Ivars Peterson 1995