Building for Failure: Learning lessons from aviation

Andrew Godwin @andrewgodwin

Andrew Godwin Hi, I'm Django core developer Senior Software Engineer
at Apparently now does software architecture

ﬂickr.com/photos/russss/16735398019/

Commercial ﬂying is very safe AIRLINES LIGHT AIRCRAFT 0.2 11.2
CARS/TRUCKS 0.53 MOTORCYCLES 15.6 Source: 2005 Nall report, 2004 NHTSA stats, 1991-2000 FAA stats, 40mph avg. road speed (fatal accidents per million hours) General aviation is still not bad

Pilot Source: 2005 Nall report Mechanical Other 76% 16% 9%
GA ACCIDENT CAUSES

Every accident is analysed To see how can we prevent
it happening

AF447 UA232

AF447 UA232 AC143

Equipment is built for failure Not if, but when.

Training is focused around failure When the time comes, you're
prepared

Software is... not as reliable How often do you see
bugs? Crashes?

Therac-25 Killed 3 Injured more

How do we improve? How do we make more reliable
systems?

Pilot Source: 2005 Nall report Mechanical Other 76% 16% 9%
GA ACCIDENT CAUSES

Human Automation Unavoidable SOFTWARE ISSUE CAUSES

Bad Patterns Soft Failure Noisy Warnings Poor Testing People Reliance

Soft Failure Obscure errors and try to carry on Hard
Failure Quit at the ﬁrst error and log it

Exceptions Raise clear, verbose exceptions. Capture and log with e.g.
Sentry try: requests.get("http://api.company.com/users/") except RequestException: raise APIFetchError("Could not get user list")

Noisy Warnings Engineers ignore logs/notiﬁcations Precise Warnings Alert on actionable
things, then ﬁx them

Actionable Warnings Don't warn about things you will ignore Email
administrators when it needs attention and can be ﬁxed

Poor Testing Small changes can cause regressions Good Testing Tests
that are complete and not fragile

100% Coverage Fallacy You can cover code lines with useless
tests You can have too many tests that are fragile so you ignore them def test_critical_function(self): try: call_critical_code() except: # This always breaks, just cover it pass

People Reliance People forget, or go on holiday Automation &
Docs Things are reproduceable and reliable

Checklists The step between manual and automation. Cheap and very
effective.

Finding the Limits How will you know when things break?

Image: © Boeing 2010

Load Testing Make sure it's realistic. Replay is best.

"Chaos Monkey" Turn off a server during quiet periods and
see what happens

Restore from backups Try using them to populate a staging
environment

The "Red Team" Employees tasked speciﬁcally with breaking things

You can't predict everything. You need to work out how
to respond to problems.

Redundancy Acceptable Loss or

Redundancy Acceptable Loss What do you fall back to? Quantify
the loss, and recovery.

No single cause Nearly all problems are cascading or multiple
failure

Clear command chains Who makes decisions? Who does the ﬁxing?

No blame culture It's not someone's mistake, it's that your
system let them do it

Communication is vital If you don't talk normally, how will
you cope with problems?

Leadership can blind Make sure people understand their responsibilities

Crew Resource Management It stops captains ﬂying planes into the
ground.

Increase your "bus factor" People get ill, stressed or leave.
You should have redundancy.

Don't reward bad code It's easy to create bugs and
then look busy ﬁxing them

Where are we? Most people stumble around issues and focus
on building things fast. There is no need for perfection - but work out what would be worst and prepare that.

Good engineering is not just code It is process; interaction;
sharing knowledge and burden. "Rockstars" not talking to each other produce awful code that interacts badly. Teams must communicate - about expectations, problems, failure and solutions.

Slower can be faster It might take time to write
a speciﬁcation, but it will save you way more time later. The cleaner your code is, the more you clean up, the less you have to maintain and the faster you ﬁx and improve things.

My advice to you? Checklists. Restore your backups. Work out
roughly what happens for every part of a system failing, and if you care. Reward people whose code quietly works, not those who ﬁreﬁght and take the glory.

My advice to you? Checklists. Restore your backups. Work out
roughly what happens for every part of a system failing, and if you care. Reward people whose code quietly works, not those who ﬁreﬁght and take the glory. Checklists.

Thanks. Andrew Godwin @andrewgodwin

Building for Failure: Learning lessons from avi...

Building for Failure: Learning lessons from aviation

More Decks by Andrew Godwin

Other Decks in Programming

Featured

Transcript