Hi, I’m
Andrew Godwin
• Django core developer
• Senior Software Engineer at
• Private + Instrument pilot
Slide 3
Slide 3 text
Content Warning
Slide 4
Slide 4 text
Software is difficult.
Slide 5
Slide 5 text
By Derek Lowe
"Things I won't work with"
Slide 6
Slide 6 text
On Hexanitrohexaazaisowurtzitane
"...a more stable form of it, by mixing it with TNT.
Yes, this is an example of something that becomes less explosive
as a one-to-one cocrystal with TNT."
Slide 7
Slide 7 text
On “Sand Won’t Save You This Time”
"...the operator is confronted with the problem of coping
with a metal-fluorine fire.
For dealing with this situation, I have always
recommended a good pair of running shoes."
Slide 8
Slide 8 text
Unicode
Locales
Time
Calendars
Geography
Money
Slide 9
Slide 9 text
Network latency
Hardware unreliability
Deadlocks
Bit flips
Ambiguous specifications
No documentation
Slide 10
Slide 10 text
We just move faster and hit them at higher speed.
Not unique to software
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
Who's solved this? Aviation.
Slide 13
Slide 13 text
A Boeing 747 has six million parts
Slide 14
Slide 14 text
…and a 0.000006% accident rate
A Boeing 747 has six million parts
Slide 15
Slide 15 text
Airplane
Car
Walking
Train
220
130
30.8
Deaths per billion hours (Per passenger, UK 1990-2000)
30
Slide 16
Slide 16 text
People matter as much as machines
Slide 17
Slide 17 text
Pilot 76%
Aviation Accident Causes (2005 Nall report)
9% Other
16% Mechanical
Slide 18
Slide 18 text
And how we can apply them to software.
Let's look at some aviation principles
Slide 19
Slide 19 text
Principle #1
Hard Failure
Slide 20
Slide 20 text
If something is wrong it turns itself off
Autopilots, engines, air conditioning, and more
Slide 21
Slide 21 text
This only works if you have redundancy
All of these systems have a backup that lets you land.
Slide 22
Slide 22 text
"We'll ignore errors so the site doesn't crash!"
"Save the invalid data and we'll fix it later"
Slide 23
Slide 23 text
These are great ways to ensure you
never fix something.
Slide 24
Slide 24 text
No accident or outage has a single cause.
Stop your code getting into odd states.
Slide 25
Slide 25 text
Fail hard if anything unexpected happens
Validate all your data strictly in and out
Deploy changes early and often
Slide 26
Slide 26 text
Single points of failure can be good
Only one place to look when things go wrong!
Slide 27
Slide 27 text
No content
Slide 28
Slide 28 text
Principle #2
Good Alerting
Slide 29
Slide 29 text
Cockpits are incredibly selective about
what sets off an audio alarm
Slide 30
Slide 30 text
Alert fatigue is real. Avoid at all costs.
Slide 31
Slide 31 text
Never, ever, put all errors in the same place
Slide 32
Slide 32 text
Critical
Normal
Background
Slide 33
Slide 33 text
Critical
Normal
Background
Wakes someone up. Actionable.
Slide 34
Slide 34 text
Critical
Normal
Background
Wakes someone up. Actionable.
Fixed over the next week.
Slide 35
Slide 35 text
Critical
Normal
Background
Wakes someone up. Actionable.
Fixed over the next week.
Metrics, not errors.
Slide 36
Slide 36 text
Have you been ignoring an error for weeks?
Then turn off its error reporting.
Slide 37
Slide 37 text
Principle #3
Find your limits
Slide 38
Slide 38 text
Everything will fail. You should know when.
Slide 39
Slide 39 text
Copyright Boeing
Slide 40
Slide 40 text
What's your Minimum Equipment List?
What can you run the system without?
Slide 41
Slide 41 text
Lavatory ashtrays
Air conditioning
Seatbelt signs
Passenger video screens
Fuel caps
Weather radar
REQUIRED OPTIONAL
Slide 42
Slide 42 text
Did you load test? Did you fuzz test?
Slide 43
Slide 43 text
You don't have to perfectly scale.
But you do have to know where your limits are.
Slide 44
Slide 44 text
Risk is fine when you're informed!
Unknowns are the most dangerous thing.
Slide 45
Slide 45 text
Principle #4
Build for failure
Slide 46
Slide 46 text
No single thing in an aircraft can
fail and take it down.
Slide 47
Slide 47 text
We all want this for our code, but
the way to do it is to build for failure.
Slide 48
Slide 48 text
Kill your application randomly
Practice server network failures
Develop on unreliable connections
Slide 49
Slide 49 text
The majority of pilot training is
handling emergencies.
Slide 50
Slide 50 text
No content
Slide 51
Slide 51 text
Use checklists. Don't rely on memory.
Slide 52
Slide 52 text
If you practice failure, you'll be ready
when the inevitable happens.
Slide 53
Slide 53 text
Pilot 76%
Aviation Accident Causes (2005 Nall report)
9% Other
16% Mechanical
Slide 54
Slide 54 text
Principle #5
Communicate well
Slide 55
Slide 55 text
"You have control"
"I have control"
"You have control"
Slide 56
Slide 56 text
Complex software means
separate teams.
Slide 57
Slide 57 text
As you grow, communication becomes
exponentially harder.
Slide 58
Slide 58 text
No content
Slide 59
Slide 59 text
No content
Slide 60
Slide 60 text
No content
Slide 61
Slide 61 text
Clear communication is vital.
Slide 62
Slide 62 text
Write everything down.
Written specs = less time in meetings.
Slide 63
Slide 63 text
Have a clear chain of command.
Slide 64
Slide 64 text
Make decisions.
They don't have to be perfect, just good enough.
Slide 65
Slide 65 text
Principle #6
No blame culture
Slide 66
Slide 66 text
How do I know all these aviation stats?
Slide 67
Slide 67 text
Every incident is reported and investigated.
Slide 68
Slide 68 text
There is never a single cause of a problem.
Slide 69
Slide 69 text
Make it very difficult to do again.
Why did your software let this happen? What's the UX of your admin tools like?
Slide 70
Slide 70 text
No content
Slide 71
Slide 71 text
No content
Slide 72
Slide 72 text
Encourage reporting.
Don't blame anyone for a mistake. They're unlikely to make it again.
Slide 73
Slide 73 text
Reward maintenance as well as firefighting
It's easy to look good when you ship broken and are always heroically fixing it.
Slide 74
Slide 74 text
No content
Slide 75
Slide 75 text
In aviation, every rule is written in blood.
Slide 76
Slide 76 text
Software is not yet there.
But we are getting closer.
Slide 77
Slide 77 text
Margaret Hamilton
Her error detection code saved Apollo 11
Slide 78
Slide 78 text
Patriot Missile
Floating-point bug killed 28
Slide 79
Slide 79 text
Therac-25
Killed 3, severely injured
at least 3 more
Slide 80
Slide 80 text
Uber Autonomous
Vehicle
Saw a pedestrian and chose
to hit her
Slide 81
Slide 81 text
No content
Slide 82
Slide 82 text
Hard failure
Good alerting
Find your limits
Build for failure
Communicate well
No blame culture