Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Failure: Learning Lessons From Aviation

Distributed Failure: Learning Lessons From Aviation

A talk I first gave at Code Europe Warsaw, spring 2018.

Andrew Godwin

April 24, 2018
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. DISTRIBUTED
    FAILURE
    Andrew Godwin
    @andrewgodwin
    Learning lessons from aviation

    View full-size slide

  2. Hi, I’m
    Andrew Godwin

    View full-size slide

  3. Content Warning
    Aviation accidents
    Road accidents
    Discussion of death

    View full-size slide

  4. Software is difficult.

    View full-size slide

  5. Distributed is even harder.

    View full-size slide

  6. Not unique to distributed systems

    View full-size slide

  7. Who's solved this? Aviation.

    View full-size slide

  8. A Boeing 747 has six million parts

    View full-size slide

  9. A Boeing 747 has six million parts

    View full-size slide

  10. Airplane
    Car
    Walking
    Train
    220
    130
    30.8
    Deaths per billion hours (UK 1990-2000)
    30

    View full-size slide

  11. People matter as much as machines

    View full-size slide

  12. Pilot 76%
    Aviation Accident Causes (2005 Nall report)
    9% Other
    16% Mechanical

    View full-size slide

  13. Let's look at some aviation principles

    View full-size slide

  14. Principle #1
    Hard Failure

    View full-size slide

  15. If something is wrong it turns itself off

    View full-size slide

  16. This only works if you have redundancy

    View full-size slide

  17. These are great ways to ensure you
    never fix something.

    View full-size slide

  18. No accident or outage has a single cause.
    Stop your code getting into odd states.

    View full-size slide

  19. Single points of failure can be good

    View full-size slide

  20. Principle #2
    Good Alerting

    View full-size slide

  21. Cockpits are incredibly selective about
    what sets off an audio alarm

    View full-size slide

  22. Alert fatigue is real. Avoid at all costs.

    View full-size slide

  23. Never, ever, put all errors in the same place

    View full-size slide

  24. Critical
    Normal
    Background

    View full-size slide

  25. Critical
    Normal
    Background
    Wakes someone up. Actionable.

    View full-size slide

  26. Critical
    Normal
    Background
    Wakes someone up. Actionable.
    Fixed over the next week.

    View full-size slide

  27. Critical
    Normal
    Background
    Wakes someone up. Actionable.
    Fixed over the next week.
    Metrics, not errors.

    View full-size slide

  28. Have you been ignoring an error for weeks?
    Then turn off its error reporting.

    View full-size slide

  29. Principle #3
    Find your limits

    View full-size slide

  30. Everything will fail. You should know when.

    View full-size slide

  31. Copyright Boeing

    View full-size slide

  32. What's your Minimum Equipment List?

    View full-size slide

  33. REQUIRED OPTIONAL

    View full-size slide

  34. Did you load test? Did you fuzz test?

    View full-size slide

  35. You don't have to perfectly scale.

    View full-size slide

  36. Risk is fine when you're informed!

    View full-size slide

  37. Principle #4
    Build for failure

    View full-size slide

  38. No single thing in an aircraft can
    fail and take it down.

    View full-size slide

  39. We all want this for our code, but
    the way to do it is to build for failure.

    View full-size slide

  40. Kill your application randomly
    Practice server network failures
    Develop on unreliable connections

    View full-size slide

  41. The majority of pilot training is
    handling emergencies.

    View full-size slide

  42. Use checklists. Don't rely on memory.

    View full-size slide

  43. If you practice failure, you'll be ready
    when the inevitable happens.

    View full-size slide

  44. Pilot 76%
    Aviation Accident Causes (2005 Nall report)
    9% Other
    16% Mechanical

    View full-size slide

  45. Principle #5
    Communicate well

    View full-size slide

  46. Distributed software means
    separate teams.

    View full-size slide

  47. As you grow, communication becomes
    exponentially harder.

    View full-size slide

  48. Clear communication is vital.

    View full-size slide

  49. Write everything down.

    View full-size slide

  50. Have a clear chain of command.

    View full-size slide

  51. Make decisions.

    View full-size slide

  52. Principle #6
    No blame culture

    View full-size slide

  53. How do I know all these aviation stats?

    View full-size slide

  54. Every incident is reported and investigated.

    View full-size slide

  55. There is never a single cause of a problem.

    View full-size slide

  56. Make it very difficult to do again.

    View full-size slide

  57. Encourage reporting.

    View full-size slide

  58. Reward maintenance as well as firefighting

    View full-size slide

  59. In aviation, every rule is written in blood.

    View full-size slide

  60. Software is not yet there.
    But we are getting closer.

    View full-size slide

  61. Margaret Hamilton
    Her error detection code saved Apollo 11

    View full-size slide

  62. Therac-25
    Killed 3, severely injured at least 3 more

    View full-size slide

  63. Hard failure
    Good alerting
    Find your limits
    Build for failure
    Communicate well
    No blame culture

    View full-size slide