Slide 1

Slide 1 text

Learning from Firefighters to improve System Reliability

Slide 2

Slide 2 text

Kerim Satirli Senior Developer Advocate, Infrastructure & Orchestration he / him @ksatirli

Slide 3

Slide 3 text

ca. 75 meters ca. 100 meters

Slide 4

Slide 4 text

ca. 825 meters ca. 300 meters

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

10 minutes

Slide 7

Slide 7 text

3000 acres

Slide 8

Slide 8 text

5000 acres

Slide 9

Slide 9 text

regulations follow incidents

Slide 10

Slide 10 text

Rules are Tools

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

• logs, metrics, traces • outliers and predictions • track missing data Order #1 Keep informed on system conditions and forecasts.

Slide 14

Slide 14 text

Know what your system is doing at all times. • architecture • external factors • (actual) weather Order #2

Slide 15

Slide 15 text

Base all actions on current and expected behavior of the system. • chaos testing • game days • incident retrospectives Order #3

Slide 16

Slide 16 text

• message queues • degrade gracefully • communicate early Order #4 Identify minimum reliability levels and make them known.

Slide 17

Slide 17 text

• monitor honey tokens • guard against insiders • possibly bait attackers Order #5 Post honeypots when there is possible danger.

Slide 18

Slide 18 text

Be alert. Keep calm. Think clearly. Act decisively. • rotate roles consistently • trust your training • know your plays Order #6

Slide 19

Slide 19 text

Maintain prompt comms with your own and adjoining teams. • organization • corp comms • customers Order #7

Slide 20

Slide 20 text

• verify comprehension • document choices • buddy system Order #8 Give clear instructions and ensure they are understood.

Slide 21

Slide 21 text

• avoid tunnel vision • don't overextend • accuracy beats expediency Order #9 Maintain control of your systems at all times.

Slide 22

Slide 22 text

• restore supported services • collect forensics • keep taking notes Order #10 Protect systems aggressively, having provided for safety first.

Slide 23

Slide 23 text

Summary

Slide 24

Slide 24 text

systems reliability is a product of structured command and control

Slide 25

Slide 25 text

systems reliability is a product of structured command and control structured command and control requires situational awareness

Slide 26

Slide 26 text

systems reliability is a product of structured command and control situational awareness requires strong communication structured command and control requires situational awareness systems reliability is a product of structured command and control structured command and control requires situational awareness

Slide 27

Slide 27 text

systems reliability is a product of strong communication

Slide 28

Slide 28 text

Incident Management for Operations Schnepp Vidal, Hawley 2017 Fatal Defect Chasing Killer Computer Bugs Ivars Peterson 1995

Slide 29

Slide 29 text

Thank you speakerdeck.com/ksatirli