Building for Failure: Learning lessons from aviation

Building for Failure: Learning lessons from aviation

A talk I gave at PyCon Taiwan 2017.

077e9a0cb34fa3eba2699240c9509717?s=128

Andrew Godwin

June 10, 2017
Tweet

Transcript

  1. Andrew Godwin @andrewgodwin

  2. Andrew Godwin Hi, I'm Django core developer Senior Software Engineer

    at Apparently now does software architecture
  3. flickr.com/photos/russss/16735398019/

  4. None
  5. Commercial flying is very safe AIRLINES LIGHT AIRCRAFT 0.2 11.2

    CARS/TRUCKS 0.53 MOTORCYCLES 15.6 Source: 2005 Nall report, 2004 NHTSA stats, 1991-2000 FAA stats, 40mph avg. road speed (fatal accidents per million hours) General aviation is still not bad
  6. Pilot Source: 2005 Nall report Mechanical Other 76% 16% 9%

    GA ACCIDENT CAUSES
  7. Every accident is analysed To see how can we prevent

    it happening
  8. AF447

  9. AF447 UA232

  10. AF447 UA232 AC143

  11. Equipment is built for failure Not if, but when.

  12. Training is focused around failure When the time comes, you're

    prepared
  13. Software is... not as reliable How often do you see

    bugs? Crashes?
  14. Therac-25 Killed 3 Injured more

  15. How do we improve? How do we make more reliable

    systems?
  16. Pilot Source: 2005 Nall report Mechanical Other 76% 16% 9%

    GA ACCIDENT CAUSES
  17. Human Automation Unavoidable SOFTWARE ISSUE CAUSES

  18. Bad Patterns Soft Failure Noisy Warnings Poor Testing People Reliance

  19. Soft Failure Obscure errors and try to carry on Hard

    Failure Quit at the first error and log it
  20. Exceptions Raise clear, verbose exceptions. Capture and log with e.g.

    Sentry try: requests.get("http://api.company.com/users/") except RequestException: raise APIFetchError("Could not get user list")
  21. Noisy Warnings Engineers ignore logs/notifications Precise Warnings Alert on actionable

    things, then fix them
  22. Actionable Warnings Don't warn about things you will ignore Email

    administrators when it needs attention and can be fixed
  23. Poor Testing Small changes can cause regressions Good Testing Tests

    that are complete and not fragile
  24. 100% Coverage Fallacy You can cover code lines with useless

    tests You can have too many tests that are fragile so you ignore them def test_critical_function(self): try: call_critical_code() except: # This always breaks, just cover it pass
  25. People Reliance People forget, or go on holiday Automation &

    Docs Things are reproduceable and reliable
  26. Checklists The step between manual and automation. Cheap and very

    effective.
  27. Finding the Limits How will you know when things break?

  28. Image: © Boeing 2010

  29. Load Testing Make sure it's realistic. Replay is best.

  30. "Chaos Monkey" Turn off a server during quiet periods and

    see what happens
  31. Restore from backups Try using them to populate a staging

    environment
  32. The "Red Team" Employees tasked specifically with breaking things

  33. None
  34. You can't predict everything. You need to work out how

    to respond to problems.
  35. Redundancy Acceptable Loss or

  36. Redundancy Acceptable Loss What do you fall back to? Quantify

    the loss, and recovery.
  37. No single cause Nearly all problems are cascading or multiple

    failure
  38. Clear command chains Who makes decisions? Who does the fixing?

  39. No blame culture It's not someone's mistake, it's that your

    system let them do it
  40. None
  41. Communication is vital If you don't talk normally, how will

    you cope with problems?
  42. Leadership can blind Make sure people understand their responsibilities

  43. Crew Resource Management It stops captains flying planes into the

    ground.
  44. Increase your "bus factor" People get ill, stressed or leave.

    You should have redundancy.
  45. Don't reward bad code It's easy to create bugs and

    then look busy fixing them
  46. None
  47. Where are we? Most people stumble around issues and focus

    on building things fast. There is no need for perfection - but work out what would be worst and prepare that.
  48. Good engineering is not just code It is process; interaction;

    sharing knowledge and burden. "Rockstars" not talking to each other produce awful code that interacts badly. Teams must communicate - about expectations, problems, failure and solutions.
  49. Slower can be faster It might take time to write

    a specification, but it will save you way more time later. The cleaner your code is, the more you clean up, the less you have to maintain and the faster you fix and improve things.
  50. My advice to you? Checklists. Restore your backups. Work out

    roughly what happens for every part of a system failing, and if you care. Reward people whose code quietly works, not those who firefight and take the glory.
  51. My advice to you? Checklists. Restore your backups. Work out

    roughly what happens for every part of a system failing, and if you care. Reward people whose code quietly works, not those who firefight and take the glory. Checklists.
  52. Thanks. Andrew Godwin @andrewgodwin