You Have Control

You Have Control

My keynote from PyCon Israel 2018.

077e9a0cb34fa3eba2699240c9509717?s=128

Andrew Godwin

June 04, 2018
Tweet

Transcript

  1. None
  2. Hi, I’m Andrew Godwin • Django core developer • Senior

    Software Engineer at • Private + Instrument pilot
  3. Content Warning

  4. Software is difficult.

  5. By Derek Lowe "Things I won't work with"

  6. On Hexanitrohexaazaisowurtzitane "...a more stable form of it, by mixing

    it with TNT. Yes, this is an example of something that becomes less explosive as a one-to-one cocrystal with TNT."
  7. On “Sand Won’t Save You This Time” "...the operator is

    confronted with the problem of coping with a metal-fluorine fire. For dealing with this situation, I have always recommended a good pair of running shoes."
  8. Unicode Locales Time Calendars Geography Money

  9. Network latency Hardware unreliability Deadlocks Bit flips Ambiguous specifications No

    documentation
  10. We just move faster and hit them at higher speed.

    Not unique to software
  11. None
  12. Who's solved this? Aviation.

  13. A Boeing 747 has six million parts

  14. …and a 0.000006% accident rate A Boeing 747 has six

    million parts
  15. Airplane Car Walking Train 220 130 30.8 Deaths per billion

    hours (Per passenger, UK 1990-2000) 30
  16. People matter as much as machines

  17. Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other

    16% Mechanical
  18. And how we can apply them to software. Let's look

    at some aviation principles
  19. Principle #1 Hard Failure

  20. If something is wrong it turns itself off Autopilots, engines,

    air conditioning, and more
  21. This only works if you have redundancy All of these

    systems have a backup that lets you land.
  22. "We'll ignore errors so the site doesn't crash!" "Save the

    invalid data and we'll fix it later"
  23. These are great ways to ensure you never fix something.

  24. No accident or outage has a single cause. Stop your

    code getting into odd states.
  25. Fail hard if anything unexpected happens Validate all your data

    strictly in and out Deploy changes early and often
  26. Single points of failure can be good Only one place

    to look when things go wrong!
  27. None
  28. Principle #2 Good Alerting

  29. Cockpits are incredibly selective about what sets off an audio

    alarm
  30. Alert fatigue is real. Avoid at all costs.

  31. Never, ever, put all errors in the same place

  32. Critical Normal Background

  33. Critical Normal Background Wakes someone up. Actionable.

  34. Critical Normal Background Wakes someone up. Actionable. Fixed over the

    next week.
  35. Critical Normal Background Wakes someone up. Actionable. Fixed over the

    next week. Metrics, not errors.
  36. Have you been ignoring an error for weeks? Then turn

    off its error reporting.
  37. Principle #3 Find your limits

  38. Everything will fail. You should know when.

  39. Copyright Boeing

  40. What's your Minimum Equipment List? What can you run the

    system without?
  41. Lavatory ashtrays Air conditioning Seatbelt signs Passenger video screens Fuel

    caps Weather radar REQUIRED OPTIONAL
  42. Did you load test? Did you fuzz test?

  43. You don't have to perfectly scale. But you do have

    to know where your limits are.
  44. Risk is fine when you're informed! Unknowns are the most

    dangerous thing.
  45. Principle #4 Build for failure

  46. No single thing in an aircraft can fail and take

    it down.
  47. We all want this for our code, but the way

    to do it is to build for failure.
  48. Kill your application randomly Practice server network failures Develop on

    unreliable connections
  49. The majority of pilot training is handling emergencies.

  50. None
  51. Use checklists. Don't rely on memory.

  52. If you practice failure, you'll be ready when the inevitable

    happens.
  53. Pilot 76% Aviation Accident Causes (2005 Nall report) 9% Other

    16% Mechanical
  54. Principle #5 Communicate well

  55. "You have control" "I have control" "You have control"

  56. Complex software means separate teams.

  57. As you grow, communication becomes exponentially harder.

  58. None
  59. None
  60. None
  61. Clear communication is vital.

  62. Write everything down. Written specs = less time in meetings.

  63. Have a clear chain of command.

  64. Make decisions. They don't have to be perfect, just good

    enough.
  65. Principle #6 No blame culture

  66. How do I know all these aviation stats?

  67. Every incident is reported and investigated.

  68. There is never a single cause of a problem.

  69. Make it very difficult to do again. Why did your

    software let this happen? What's the UX of your admin tools like?
  70. None
  71. None
  72. Encourage reporting. Don't blame anyone for a mistake. They're unlikely

    to make it again.
  73. Reward maintenance as well as firefighting It's easy to look

    good when you ship broken and are always heroically fixing it.
  74. None
  75. In aviation, every rule is written in blood.

  76. Software is not yet there. But we are getting closer.

  77. Margaret Hamilton Her error detection code saved Apollo 11

  78. Patriot Missile Floating-point bug killed 28

  79. Therac-25 Killed 3, severely injured at least 3 more

  80. Uber Autonomous Vehicle Saw a pedestrian and chose to hit

    her
  81. None
  82. Hard failure Good alerting Find your limits Build for failure

    Communicate well No blame culture
  83. Thanks. Andrew Godwin @andrewgodwin aeracode.org