Pro Yearly is on sale from $80 to $50! »

Distributed Failure: Learning Lessons From Aviation

Distributed Failure: Learning Lessons From Aviation

A talk I first gave at Code Europe Warsaw, spring 2018.

077e9a0cb34fa3eba2699240c9509717?s=128

Andrew Godwin

April 24, 2018
Tweet

Transcript

  1. DISTRIBUTED FAILURE Andrew Godwin @andrewgodwin Learning lessons from aviation

  2. Hi, I’m Andrew Godwin

  3. Content Warning Aviation accidents Road accidents Discussion of death

  4. Software is difficult.

  5. Distributed is even harder.

  6. None
  7. Not unique to distributed systems

  8. None
  9. Who's solved this? Aviation.

  10. A Boeing 747 has six million parts

  11. A Boeing 747 has six million parts

  12. Airplane Car Walking Train 220 130 30.8 Deaths per billion

    hours (UK 1990-2000) 30
  13. People matter as much as machines

  14. Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other

    16% Mechanical
  15. Let's look at some aviation principles

  16. Principle #1 Hard Failure

  17. If something is wrong it turns itself off

  18. This only works if you have redundancy

  19. None
  20. These are great ways to ensure you never fix something.

  21. No accident or outage has a single cause. Stop your

    code getting into odd states.
  22. None
  23. Single points of failure can be good

  24. None
  25. Principle #2 Good Alerting

  26. Cockpits are incredibly selective about what sets off an audio

    alarm
  27. Alert fatigue is real. Avoid at all costs.

  28. Never, ever, put all errors in the same place

  29. Critical Normal Background

  30. Critical Normal Background Wakes someone up. Actionable.

  31. Critical Normal Background Wakes someone up. Actionable. Fixed over the

    next week.
  32. Critical Normal Background Wakes someone up. Actionable. Fixed over the

    next week. Metrics, not errors.
  33. Have you been ignoring an error for weeks? Then turn

    off its error reporting.
  34. Principle #3 Find your limits

  35. Everything will fail. You should know when.

  36. Copyright Boeing

  37. What's your Minimum Equipment List?

  38. REQUIRED OPTIONAL

  39. Did you load test? Did you fuzz test?

  40. You don't have to perfectly scale.

  41. Risk is fine when you're informed!

  42. Principle #4 Build for failure

  43. No single thing in an aircraft can fail and take

    it down.
  44. We all want this for our code, but the way

    to do it is to build for failure.
  45. Kill your application randomly Practice server network failures Develop on

    unreliable connections
  46. The majority of pilot training is handling emergencies.

  47. None
  48. Use checklists. Don't rely on memory.

  49. If you practice failure, you'll be ready when the inevitable

    happens.
  50. Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other

    16% Mechanical
  51. Principle #5 Communicate well

  52. Distributed software means separate teams.

  53. As you grow, communication becomes exponentially harder.

  54. None
  55. None
  56. None
  57. Clear communication is vital.

  58. Write everything down.

  59. Have a clear chain of command.

  60. Make decisions.

  61. Principle #6 No blame culture

  62. How do I know all these aviation stats?

  63. Every incident is reported and investigated.

  64. There is never a single cause of a problem.

  65. Make it very difficult to do again.

  66. None
  67. None
  68. Encourage reporting.

  69. Reward maintenance as well as firefighting

  70. None
  71. In aviation, every rule is written in blood.

  72. Software is not yet there. But we are getting closer.

  73. Margaret Hamilton Her error detection code saved Apollo 11

  74. Therac-25 Killed 3, severely injured at least 3 more

  75. None
  76. None
  77. Hard failure Good alerting Find your limits Build for failure

    Communicate well No blame culture
  78. Thanks.