Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building for Failure: Learning lessons from aviation

Building for Failure: Learning lessons from aviation

A talk I gave at PyCon Taiwan 2017.

Andrew Godwin

June 10, 2017
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. Andrew Godwin
    @andrewgodwin

    View Slide

  2. Andrew Godwin
    Hi, I'm
    Django core developer
    Senior Software Engineer at
    Apparently now does software architecture

    View Slide

  3. flickr.com/photos/russss/16735398019/

    View Slide

  4. View Slide

  5. Commercial flying is very safe
    AIRLINES
    LIGHT AIRCRAFT
    0.2
    11.2
    CARS/TRUCKS 0.53
    MOTORCYCLES 15.6
    Source: 2005 Nall report, 2004 NHTSA stats, 1991-2000 FAA stats, 40mph avg. road speed
    (fatal accidents per million hours)
    General aviation is still not bad

    View Slide

  6. Pilot
    Source: 2005 Nall report
    Mechanical
    Other
    76%
    16%
    9%
    GA
    ACCIDENT
    CAUSES

    View Slide

  7. Every accident is analysed
    To see how can we prevent it happening

    View Slide

  8. AF447

    View Slide

  9. AF447
    UA232

    View Slide

  10. AF447
    UA232
    AC143

    View Slide

  11. Equipment is built for failure
    Not if, but when.

    View Slide

  12. Training is focused around failure
    When the time comes, you're prepared

    View Slide

  13. Software is... not as reliable
    How often do you see bugs? Crashes?

    View Slide

  14. Therac-25
    Killed 3
    Injured more

    View Slide

  15. How do we improve?
    How do we make more reliable systems?

    View Slide

  16. Pilot
    Source: 2005 Nall report
    Mechanical
    Other
    76%
    16%
    9%
    GA
    ACCIDENT
    CAUSES

    View Slide

  17. Human
    Automation
    Unavoidable
    SOFTWARE
    ISSUE
    CAUSES

    View Slide

  18. Bad Patterns
    Soft Failure
    Noisy Warnings
    Poor Testing
    People Reliance

    View Slide

  19. Soft Failure
    Obscure errors and try to carry on
    Hard Failure
    Quit at the first error and log it

    View Slide

  20. Exceptions
    Raise clear, verbose exceptions.
    Capture and log with e.g. Sentry
    try:
    requests.get("http://api.company.com/users/")
    except RequestException:
    raise APIFetchError("Could not get user list")

    View Slide

  21. Noisy Warnings
    Engineers ignore logs/notifications
    Precise Warnings
    Alert on actionable things, then fix them

    View Slide

  22. Actionable Warnings
    Don't warn about things you will ignore
    Email administrators when it
    needs attention and can be fixed

    View Slide

  23. Poor Testing
    Small changes can cause regressions
    Good Testing
    Tests that are complete and not fragile

    View Slide

  24. 100% Coverage Fallacy
    You can cover code lines with useless tests
    You can have too many tests that are fragile so you ignore them
    def test_critical_function(self):
    try:
    call_critical_code()
    except:
    # This always breaks, just cover it
    pass

    View Slide

  25. People Reliance
    People forget, or go on holiday
    Automation & Docs
    Things are reproduceable and reliable

    View Slide

  26. Checklists
    The step between manual
    and automation.
    Cheap and very effective.

    View Slide

  27. Finding the Limits
    How will you know when things break?

    View Slide

  28. Image: © Boeing 2010

    View Slide

  29. Load Testing
    Make sure it's realistic. Replay is best.

    View Slide

  30. "Chaos Monkey"
    Turn off a server during quiet periods and see what happens

    View Slide

  31. Restore from backups
    Try using them to populate a staging environment

    View Slide

  32. The "Red Team"
    Employees tasked specifically with breaking things

    View Slide

  33. View Slide

  34. You can't predict everything.
    You need to work out how to respond to problems.

    View Slide

  35. Redundancy Acceptable Loss
    or

    View Slide

  36. Redundancy Acceptable Loss
    What do you fall back to? Quantify the loss, and recovery.

    View Slide

  37. No single cause
    Nearly all problems are cascading or multiple failure

    View Slide

  38. Clear command chains
    Who makes decisions? Who does the fixing?

    View Slide

  39. No blame culture
    It's not someone's mistake, it's that your system let them do it

    View Slide

  40. View Slide

  41. Communication is vital
    If you don't talk normally, how will you cope with problems?

    View Slide

  42. Leadership can blind
    Make sure people understand their responsibilities

    View Slide

  43. Crew Resource Management
    It stops captains flying planes into the ground.

    View Slide

  44. Increase your "bus factor"
    People get ill, stressed or leave. You should have redundancy.

    View Slide

  45. Don't reward bad code
    It's easy to create bugs and then look busy fixing them

    View Slide

  46. View Slide

  47. Where are we?
    Most people stumble around issues and
    focus on building things fast.
    There is no need for perfection - but work
    out what would be worst and prepare that.

    View Slide

  48. Good engineering is not just code
    It is process; interaction; sharing knowledge and burden.
    "Rockstars" not talking to each other
    produce awful code that interacts badly.
    Teams must communicate - about expectations,
    problems, failure and solutions.

    View Slide

  49. Slower can be faster
    It might take time to write a specification,
    but it will save you way more time later.
    The cleaner your code is, the more you clean
    up, the less you have to maintain and the faster
    you fix and improve things.

    View Slide

  50. My advice to you?
    Checklists.
    Restore your backups.
    Work out roughly what happens for every
    part of a system failing, and if you care.
    Reward people whose code quietly works,
    not those who firefight and take the glory.

    View Slide

  51. My advice to you?
    Checklists.
    Restore your backups.
    Work out roughly what happens for every
    part of a system failing, and if you care.
    Reward people whose code quietly works,
    not those who firefight and take the glory.
    Checklists.

    View Slide

  52. Thanks.
    Andrew Godwin
    @andrewgodwin

    View Slide