Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Hi, I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot

Slide 3

Slide 3 text

Content Warning

Slide 4

Slide 4 text

Software is difficult.

Slide 5

Slide 5 text

By Derek Lowe "Things I won't work with"

Slide 6

Slide 6 text

On Hexanitrohexaazaisowurtzitane "...a more stable form of it, by mixing it with TNT. Yes, this is an example of something that becomes less explosive as a one-to-one cocrystal with TNT."

Slide 7

Slide 7 text

On “Sand Won’t Save You This Time” "...the operator is confronted with the problem of coping with a metal-fluorine fire. For dealing with this situation, I have always recommended a good pair of running shoes."

Slide 8

Slide 8 text

Unicode Locales Time Calendars Geography Money

Slide 9

Slide 9 text

Network latency Hardware unreliability Deadlocks Bit flips Ambiguous specifications No documentation

Slide 10

Slide 10 text

We just move faster and hit them at higher speed. Not unique to software

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Who's solved this? Aviation.

Slide 13

Slide 13 text

A Boeing 747 has six million parts

Slide 14

Slide 14 text

…and a 0.000006% accident rate A Boeing 747 has six million parts

Slide 15

Slide 15 text

Airplane Car Walking Train 220 130 30.8 Deaths per billion hours (Per passenger, UK 1990-2000) 30

Slide 16

Slide 16 text

People matter as much as machines

Slide 17

Slide 17 text

Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other 16% Mechanical

Slide 18

Slide 18 text

And how we can apply them to software. Let's look at some aviation principles

Slide 19

Slide 19 text

Principle #1 Hard Failure

Slide 20

Slide 20 text

If something is wrong it turns itself off Autopilots, engines, air conditioning, and more

Slide 21

Slide 21 text

This only works if you have redundancy All of these systems have a backup that lets you land.

Slide 22

Slide 22 text

"We'll ignore errors so the site doesn't crash!" "Save the invalid data and we'll fix it later"

Slide 23

Slide 23 text

These are great ways to ensure you never fix something.

Slide 24

Slide 24 text

No accident or outage has a single cause. Stop your code getting into odd states.

Slide 25

Slide 25 text

Fail hard if anything unexpected happens Validate all your data strictly in and out Deploy changes early and often

Slide 26

Slide 26 text

Single points of failure can be good Only one place to look when things go wrong!

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Principle #2 Good Alerting

Slide 29

Slide 29 text

Cockpits are incredibly selective about what sets off an audio alarm

Slide 30

Slide 30 text

Alert fatigue is real. Avoid at all costs.

Slide 31

Slide 31 text

Never, ever, put all errors in the same place

Slide 32

Slide 32 text

Critical Normal Background

Slide 33

Slide 33 text

Critical Normal Background Wakes someone up. Actionable.

Slide 34

Slide 34 text

Critical Normal Background Wakes someone up. Actionable. Fixed over the next week.

Slide 35

Slide 35 text

Critical Normal Background Wakes someone up. Actionable. Fixed over the next week. Metrics, not errors.

Slide 36

Slide 36 text

Have you been ignoring an error for weeks? Then turn off its error reporting.

Slide 37

Slide 37 text

Principle #3 Find your limits

Slide 38

Slide 38 text

Everything will fail. You should know when.

Slide 39

Slide 39 text

Copyright Boeing

Slide 40

Slide 40 text

What's your Minimum Equipment List? What can you run the system without?

Slide 41

Slide 41 text

Lavatory ashtrays Air conditioning Seatbelt signs Passenger video screens Fuel caps Weather radar REQUIRED OPTIONAL

Slide 42

Slide 42 text

Did you load test? Did you fuzz test?

Slide 43

Slide 43 text

You don't have to perfectly scale. But you do have to know where your limits are.

Slide 44

Slide 44 text

Risk is fine when you're informed! Unknowns are the most dangerous thing.

Slide 45

Slide 45 text

Principle #4 Build for failure

Slide 46

Slide 46 text

No single thing in an aircraft can fail and take it down.

Slide 47

Slide 47 text

We all want this for our code, but the way to do it is to build for failure.

Slide 48

Slide 48 text

Kill your application randomly Practice server network failures Develop on unreliable connections

Slide 49

Slide 49 text

The majority of pilot training is handling emergencies.

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

Use checklists. Don't rely on memory.

Slide 52

Slide 52 text

If you practice failure, you'll be ready when the inevitable happens.

Slide 53

Slide 53 text

Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other 16% Mechanical

Slide 54

Slide 54 text

Principle #5 Communicate well

Slide 55

Slide 55 text

"You have control" "I have control" "You have control"

Slide 56

Slide 56 text

Complex software means separate teams.

Slide 57

Slide 57 text

As you grow, communication becomes exponentially harder.

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

Clear communication is vital.

Slide 62

Slide 62 text

Write everything down. Written specs = less time in meetings.

Slide 63

Slide 63 text

Have a clear chain of command.

Slide 64

Slide 64 text

Make decisions. They don't have to be perfect, just good enough.

Slide 65

Slide 65 text

Principle #6 No blame culture

Slide 66

Slide 66 text

How do I know all these aviation stats?

Slide 67

Slide 67 text

Every incident is reported and investigated.

Slide 68

Slide 68 text

There is never a single cause of a problem.

Slide 69

Slide 69 text

Make it very difficult to do again. Why did your software let this happen? What's the UX of your admin tools like?

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

Encourage reporting. Don't blame anyone for a mistake. They're unlikely to make it again.

Slide 73

Slide 73 text

Reward maintenance as well as firefighting It's easy to look good when you ship broken and are always heroically fixing it.

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

In aviation, every rule is written in blood.

Slide 76

Slide 76 text

Software is not yet there. But we are getting closer.

Slide 77

Slide 77 text

Margaret Hamilton Her error detection code saved Apollo 11

Slide 78

Slide 78 text

Patriot Missile Floating-point bug killed 28

Slide 79

Slide 79 text

Therac-25 Killed 3, severely injured at least 3 more

Slide 80

Slide 80 text

Uber Autonomous Vehicle Saw a pedestrian and chose to hit her

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

Hard failure Good alerting Find your limits Build for failure Communicate well No blame culture

Slide 83

Slide 83 text

Thanks. Andrew Godwin @andrewgodwin aeracode.org