Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Failure: Learning Lessons From Aviation

Distributed Failure: Learning Lessons From Aviation

A talk I first gave at Code Europe Warsaw, spring 2018.

Andrew Godwin

April 24, 2018
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. DISTRIBUTED
    FAILURE
    Andrew Godwin
    @andrewgodwin
    Learning lessons from aviation

    View Slide

  2. Hi, I’m
    Andrew Godwin

    View Slide

  3. Content Warning
    Aviation accidents
    Road accidents
    Discussion of death

    View Slide

  4. Software is difficult.

    View Slide

  5. Distributed is even harder.

    View Slide

  6. View Slide

  7. Not unique to distributed systems

    View Slide

  8. View Slide

  9. Who's solved this? Aviation.

    View Slide

  10. A Boeing 747 has six million parts

    View Slide

  11. A Boeing 747 has six million parts

    View Slide

  12. Airplane
    Car
    Walking
    Train
    220
    130
    30.8
    Deaths per billion hours (UK 1990-2000)
    30

    View Slide

  13. People matter as much as machines

    View Slide

  14. Pilot 76%
    Aviation Accident Causes (2005 Nall report)
    9% Other
    16% Mechanical

    View Slide

  15. Let's look at some aviation principles

    View Slide

  16. Principle #1
    Hard Failure

    View Slide

  17. If something is wrong it turns itself off

    View Slide

  18. This only works if you have redundancy

    View Slide

  19. View Slide

  20. These are great ways to ensure you
    never fix something.

    View Slide

  21. No accident or outage has a single cause.
    Stop your code getting into odd states.

    View Slide

  22. View Slide

  23. Single points of failure can be good

    View Slide

  24. View Slide

  25. Principle #2
    Good Alerting

    View Slide

  26. Cockpits are incredibly selective about
    what sets off an audio alarm

    View Slide

  27. Alert fatigue is real. Avoid at all costs.

    View Slide

  28. Never, ever, put all errors in the same place

    View Slide

  29. Critical
    Normal
    Background

    View Slide

  30. Critical
    Normal
    Background
    Wakes someone up. Actionable.

    View Slide

  31. Critical
    Normal
    Background
    Wakes someone up. Actionable.
    Fixed over the next week.

    View Slide

  32. Critical
    Normal
    Background
    Wakes someone up. Actionable.
    Fixed over the next week.
    Metrics, not errors.

    View Slide

  33. Have you been ignoring an error for weeks?
    Then turn off its error reporting.

    View Slide

  34. Principle #3
    Find your limits

    View Slide

  35. Everything will fail. You should know when.

    View Slide

  36. Copyright Boeing

    View Slide

  37. What's your Minimum Equipment List?

    View Slide

  38. REQUIRED OPTIONAL

    View Slide

  39. Did you load test? Did you fuzz test?

    View Slide

  40. You don't have to perfectly scale.

    View Slide

  41. Risk is fine when you're informed!

    View Slide

  42. Principle #4
    Build for failure

    View Slide

  43. No single thing in an aircraft can
    fail and take it down.

    View Slide

  44. We all want this for our code, but
    the way to do it is to build for failure.

    View Slide

  45. Kill your application randomly
    Practice server network failures
    Develop on unreliable connections

    View Slide

  46. The majority of pilot training is
    handling emergencies.

    View Slide

  47. View Slide

  48. Use checklists. Don't rely on memory.

    View Slide

  49. If you practice failure, you'll be ready
    when the inevitable happens.

    View Slide

  50. Pilot 76%
    Aviation Accident Causes (2005 Nall report)
    9% Other
    16% Mechanical

    View Slide

  51. Principle #5
    Communicate well

    View Slide

  52. Distributed software means
    separate teams.

    View Slide

  53. As you grow, communication becomes
    exponentially harder.

    View Slide

  54. View Slide

  55. View Slide

  56. View Slide

  57. Clear communication is vital.

    View Slide

  58. Write everything down.

    View Slide

  59. Have a clear chain of command.

    View Slide

  60. Make decisions.

    View Slide

  61. Principle #6
    No blame culture

    View Slide

  62. How do I know all these aviation stats?

    View Slide

  63. Every incident is reported and investigated.

    View Slide

  64. There is never a single cause of a problem.

    View Slide

  65. Make it very difficult to do again.

    View Slide

  66. View Slide

  67. View Slide

  68. Encourage reporting.

    View Slide

  69. Reward maintenance as well as firefighting

    View Slide

  70. View Slide

  71. In aviation, every rule is written in blood.

    View Slide

  72. Software is not yet there.
    But we are getting closer.

    View Slide

  73. Margaret Hamilton
    Her error detection code saved Apollo 11

    View Slide

  74. Therac-25
    Killed 3, severely injured at least 3 more

    View Slide

  75. View Slide

  76. View Slide

  77. Hard failure
    Good alerting
    Find your limits
    Build for failure
    Communicate well
    No blame culture

    View Slide

  78. Thanks.

    View Slide