Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
DISTRIBUTED FAILURE Andrew Godwin @andrewgodwin Learning lessons from aviation
Slide 2
Slide 2 text
Hi, I’m Andrew Godwin
Slide 3
Slide 3 text
Content Warning Aviation accidents Road accidents Discussion of death
Slide 4
Slide 4 text
Software is difficult.
Slide 5
Slide 5 text
Distributed is even harder.
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
Not unique to distributed systems
Slide 8
Slide 8 text
No content
Slide 9
Slide 9 text
Who's solved this? Aviation.
Slide 10
Slide 10 text
A Boeing 747 has six million parts
Slide 11
Slide 11 text
A Boeing 747 has six million parts
Slide 12
Slide 12 text
Airplane Car Walking Train 220 130 30.8 Deaths per billion hours (UK 1990-2000) 30
Slide 13
Slide 13 text
People matter as much as machines
Slide 14
Slide 14 text
Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other 16% Mechanical
Slide 15
Slide 15 text
Let's look at some aviation principles
Slide 16
Slide 16 text
Principle #1 Hard Failure
Slide 17
Slide 17 text
If something is wrong it turns itself off
Slide 18
Slide 18 text
This only works if you have redundancy
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
These are great ways to ensure you never fix something.
Slide 21
Slide 21 text
No accident or outage has a single cause. Stop your code getting into odd states.
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
Single points of failure can be good
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
Principle #2 Good Alerting
Slide 26
Slide 26 text
Cockpits are incredibly selective about what sets off an audio alarm
Slide 27
Slide 27 text
Alert fatigue is real. Avoid at all costs.
Slide 28
Slide 28 text
Never, ever, put all errors in the same place
Slide 29
Slide 29 text
Critical Normal Background
Slide 30
Slide 30 text
Critical Normal Background Wakes someone up. Actionable.
Slide 31
Slide 31 text
Critical Normal Background Wakes someone up. Actionable. Fixed over the next week.
Slide 32
Slide 32 text
Critical Normal Background Wakes someone up. Actionable. Fixed over the next week. Metrics, not errors.
Slide 33
Slide 33 text
Have you been ignoring an error for weeks? Then turn off its error reporting.
Slide 34
Slide 34 text
Principle #3 Find your limits
Slide 35
Slide 35 text
Everything will fail. You should know when.
Slide 36
Slide 36 text
Copyright Boeing
Slide 37
Slide 37 text
What's your Minimum Equipment List?
Slide 38
Slide 38 text
REQUIRED OPTIONAL
Slide 39
Slide 39 text
Did you load test? Did you fuzz test?
Slide 40
Slide 40 text
You don't have to perfectly scale.
Slide 41
Slide 41 text
Risk is fine when you're informed!
Slide 42
Slide 42 text
Principle #4 Build for failure
Slide 43
Slide 43 text
No single thing in an aircraft can fail and take it down.
Slide 44
Slide 44 text
We all want this for our code, but the way to do it is to build for failure.
Slide 45
Slide 45 text
Kill your application randomly Practice server network failures Develop on unreliable connections
Slide 46
Slide 46 text
The majority of pilot training is handling emergencies.
Slide 47
Slide 47 text
No content
Slide 48
Slide 48 text
Use checklists. Don't rely on memory.
Slide 49
Slide 49 text
If you practice failure, you'll be ready when the inevitable happens.
Slide 50
Slide 50 text
Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other 16% Mechanical
Slide 51
Slide 51 text
Principle #5 Communicate well
Slide 52
Slide 52 text
Distributed software means separate teams.
Slide 53
Slide 53 text
As you grow, communication becomes exponentially harder.
Slide 54
Slide 54 text
No content
Slide 55
Slide 55 text
No content
Slide 56
Slide 56 text
No content
Slide 57
Slide 57 text
Clear communication is vital.
Slide 58
Slide 58 text
Write everything down.
Slide 59
Slide 59 text
Have a clear chain of command.
Slide 60
Slide 60 text
Make decisions.
Slide 61
Slide 61 text
Principle #6 No blame culture
Slide 62
Slide 62 text
How do I know all these aviation stats?
Slide 63
Slide 63 text
Every incident is reported and investigated.
Slide 64
Slide 64 text
There is never a single cause of a problem.
Slide 65
Slide 65 text
Make it very difficult to do again.
Slide 66
Slide 66 text
No content
Slide 67
Slide 67 text
No content
Slide 68
Slide 68 text
Encourage reporting.
Slide 69
Slide 69 text
Reward maintenance as well as firefighting
Slide 70
Slide 70 text
No content
Slide 71
Slide 71 text
In aviation, every rule is written in blood.
Slide 72
Slide 72 text
Software is not yet there. But we are getting closer.
Slide 73
Slide 73 text
Margaret Hamilton Her error detection code saved Apollo 11
Slide 74
Slide 74 text
Therac-25 Killed 3, severely injured at least 3 more
Slide 75
Slide 75 text
No content
Slide 76
Slide 76 text
No content
Slide 77
Slide 77 text
Hard failure Good alerting Find your limits Build for failure Communicate well No blame culture
Slide 78
Slide 78 text
Thanks.