Slide 1

Slide 1 text

DISTRIBUTED FAILURE Andrew Godwin @andrewgodwin Learning lessons from aviation

Slide 2

Slide 2 text

Hi, I’m Andrew Godwin

Slide 3

Slide 3 text

Content Warning Aviation accidents Road accidents Discussion of death

Slide 4

Slide 4 text

Software is difficult.

Slide 5

Slide 5 text

Distributed is even harder.

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Not unique to distributed systems

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Who's solved this? Aviation.

Slide 10

Slide 10 text

A Boeing 747 has six million parts

Slide 11

Slide 11 text

A Boeing 747 has six million parts

Slide 12

Slide 12 text

Airplane Car Walking Train 220 130 30.8 Deaths per billion hours (UK 1990-2000) 30

Slide 13

Slide 13 text

People matter as much as machines

Slide 14

Slide 14 text

Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other 16% Mechanical

Slide 15

Slide 15 text

Let's look at some aviation principles

Slide 16

Slide 16 text

Principle #1 Hard Failure

Slide 17

Slide 17 text

If something is wrong it turns itself off

Slide 18

Slide 18 text

This only works if you have redundancy

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

These are great ways to ensure you never fix something.

Slide 21

Slide 21 text

No accident or outage has a single cause. Stop your code getting into odd states.

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Single points of failure can be good

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Principle #2 Good Alerting

Slide 26

Slide 26 text

Cockpits are incredibly selective about what sets off an audio alarm

Slide 27

Slide 27 text

Alert fatigue is real. Avoid at all costs.

Slide 28

Slide 28 text

Never, ever, put all errors in the same place

Slide 29

Slide 29 text

Critical Normal Background

Slide 30

Slide 30 text

Critical Normal Background Wakes someone up. Actionable.

Slide 31

Slide 31 text

Critical Normal Background Wakes someone up. Actionable. Fixed over the next week.

Slide 32

Slide 32 text

Critical Normal Background Wakes someone up. Actionable. Fixed over the next week. Metrics, not errors.

Slide 33

Slide 33 text

Have you been ignoring an error for weeks? Then turn off its error reporting.

Slide 34

Slide 34 text

Principle #3 Find your limits

Slide 35

Slide 35 text

Everything will fail. You should know when.

Slide 36

Slide 36 text

Copyright Boeing

Slide 37

Slide 37 text

What's your Minimum Equipment List?

Slide 38

Slide 38 text

REQUIRED OPTIONAL

Slide 39

Slide 39 text

Did you load test? Did you fuzz test?

Slide 40

Slide 40 text

You don't have to perfectly scale.

Slide 41

Slide 41 text

Risk is fine when you're informed!

Slide 42

Slide 42 text

Principle #4 Build for failure

Slide 43

Slide 43 text

No single thing in an aircraft can fail and take it down.

Slide 44

Slide 44 text

We all want this for our code, but the way to do it is to build for failure.

Slide 45

Slide 45 text

Kill your application randomly Practice server network failures Develop on unreliable connections

Slide 46

Slide 46 text

The majority of pilot training is handling emergencies.

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Use checklists. Don't rely on memory.

Slide 49

Slide 49 text

If you practice failure, you'll be ready when the inevitable happens.

Slide 50

Slide 50 text

Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other 16% Mechanical

Slide 51

Slide 51 text

Principle #5 Communicate well

Slide 52

Slide 52 text

Distributed software means separate teams.

Slide 53

Slide 53 text

As you grow, communication becomes exponentially harder.

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

Clear communication is vital.

Slide 58

Slide 58 text

Write everything down.

Slide 59

Slide 59 text

Have a clear chain of command.

Slide 60

Slide 60 text

Make decisions.

Slide 61

Slide 61 text

Principle #6 No blame culture

Slide 62

Slide 62 text

How do I know all these aviation stats?

Slide 63

Slide 63 text

Every incident is reported and investigated.

Slide 64

Slide 64 text

There is never a single cause of a problem.

Slide 65

Slide 65 text

Make it very difficult to do again.

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

Encourage reporting.

Slide 69

Slide 69 text

Reward maintenance as well as firefighting

Slide 70

Slide 70 text

No content

Slide 71

Slide 71 text

In aviation, every rule is written in blood.

Slide 72

Slide 72 text

Software is not yet there. But we are getting closer.

Slide 73

Slide 73 text

Margaret Hamilton Her error detection code saved Apollo 11

Slide 74

Slide 74 text

Therac-25 Killed 3, severely injured at least 3 more

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

No content

Slide 77

Slide 77 text

Hard failure Good alerting Find your limits Build for failure Communicate well No blame culture

Slide 78

Slide 78 text

Thanks.