Building for Failure: Learning lessons from aviation

Building for Failure: Learning lessons from aviation

A talk I gave at PyCon Taiwan 2017.

077e9a0cb34fa3eba2699240c9509717?s=128

Andrew Godwin

June 10, 2017
Tweet

Transcript

  1. 2.

    Andrew Godwin Hi, I'm Django core developer Senior Software Engineer

    at Apparently now does software architecture
  2. 4.
  3. 5.

    Commercial flying is very safe AIRLINES LIGHT AIRCRAFT 0.2 11.2

    CARS/TRUCKS 0.53 MOTORCYCLES 15.6 Source: 2005 Nall report, 2004 NHTSA stats, 1991-2000 FAA stats, 40mph avg. road speed (fatal accidents per million hours) General aviation is still not bad
  4. 8.
  5. 19.

    Soft Failure Obscure errors and try to carry on Hard

    Failure Quit at the first error and log it
  6. 20.

    Exceptions Raise clear, verbose exceptions. Capture and log with e.g.

    Sentry try: requests.get("http://api.company.com/users/") except RequestException: raise APIFetchError("Could not get user list")
  7. 22.

    Actionable Warnings Don't warn about things you will ignore Email

    administrators when it needs attention and can be fixed
  8. 24.

    100% Coverage Fallacy You can cover code lines with useless

    tests You can have too many tests that are fragile so you ignore them def test_critical_function(self): try: call_critical_code() except: # This always breaks, just cover it pass
  9. 25.

    People Reliance People forget, or go on holiday Automation &

    Docs Things are reproduceable and reliable
  10. 33.
  11. 40.
  12. 45.
  13. 46.
  14. 47.

    Where are we? Most people stumble around issues and focus

    on building things fast. There is no need for perfection - but work out what would be worst and prepare that.
  15. 48.

    Good engineering is not just code It is process; interaction;

    sharing knowledge and burden. "Rockstars" not talking to each other produce awful code that interacts badly. Teams must communicate - about expectations, problems, failure and solutions.
  16. 49.

    Slower can be faster It might take time to write

    a specification, but it will save you way more time later. The cleaner your code is, the more you clean up, the less you have to maintain and the faster you fix and improve things.
  17. 50.

    My advice to you? Checklists. Restore your backups. Work out

    roughly what happens for every part of a system failing, and if you care. Reward people whose code quietly works, not those who firefight and take the glory.
  18. 51.

    My advice to you? Checklists. Restore your backups. Work out

    roughly what happens for every part of a system failing, and if you care. Reward people whose code quietly works, not those who firefight and take the glory. Checklists.