Learning from Firefighters
to improve System Reliability
Slide 2
Slide 2 text
Kerim Satirli
Senior Developer Advocate,
Infrastructure & Orchestration
he / him
Slide 3
Slide 3 text
ca. 75 meters
ca. 100 meters
Slide 4
Slide 4 text
ca. 825 meters
ca. 300 meters
Slide 5
Slide 5 text
No content
Slide 6
Slide 6 text
10 minutes
Slide 7
Slide 7 text
3000 acres
Slide 8
Slide 8 text
5000 acres
Slide 9
Slide 9 text
regulations follow incidents
Slide 10
Slide 10 text
Rules are Tools
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
• logs, metrics, traces
• outliers and predictions
• track missing data
Order #1
Keep informed on system
conditions and forecasts.
Slide 14
Slide 14 text
Know what your system
is doing at all times.
• architecture
• external factors
• (actual) weather
Order #2
Slide 15
Slide 15 text
Base all actions on current and
expected behavior of the system.
• chaos testing
• game days
• incident retrospectives
Order #3
Slide 16
Slide 16 text
• message queues
• degrade gracefully
• communicate early
Order #4
Identify minimum reliability
levels and make them known.
Slide 17
Slide 17 text
• monitor honey tokens
• guard against insiders
• possibly bait attackers
Order #5
Post honeypots when there
is possible danger.
Slide 18
Slide 18 text
Be alert.
Keep calm.
Think clearly.
Act decisively.
• rotate roles consistently
• trust your training
• know your plays
Order #6
Slide 19
Slide 19 text
Maintain prompt comms with
your own and adjoining teams.
• organization
• corp comms
• customers
Order #7
Slide 20
Slide 20 text
• verify comprehension
• document choices
• buddy system
Order #8
Give clear instructions and
ensure they are understood.
Slide 21
Slide 21 text
• avoid tunnel vision
• don't overextend
• accuracy beats expediency
Order #9
Maintain control of your
systems at all times.
Slide 22
Slide 22 text
• restore supported services
• collect forensics
• keep taking notes
Order #10
Protect systems aggressively,
having provided for safety first.
Slide 23
Slide 23 text
Slide 24
Slide 24 text
systems reliability is a product of
structured command and control
Slide 25
Slide 25 text
systems reliability is a product of
structured command and control
structured command and control
requires situational awareness
Slide 26
Slide 26 text
systems reliability is a product of
structured command and control
situational awareness
requires strong communication
structured command and control
requires situational awareness
systems reliability is a product of
structured command and control
structured command and control
requires situational awareness
Slide 27
Slide 27 text
systems reliability is a product
of strong communication