Upgrade to Pro — share decks privately, control downloads, hide ads and more …

You Have Control

You Have Control

My keynote from PyCon Israel 2018.

Andrew Godwin

June 04, 2018
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. View Slide

  2. Hi, I’m
    Andrew Godwin
    • Django core developer
    • Senior Software Engineer at
    • Private + Instrument pilot

    View Slide

  3. Content Warning

    View Slide

  4. Software is difficult.

    View Slide

  5. By Derek Lowe
    "Things I won't work with"

    View Slide

  6. On Hexanitrohexaazaisowurtzitane
    "...a more stable form of it, by mixing it with TNT.
    Yes, this is an example of something that becomes less explosive
    as a one-to-one cocrystal with TNT."

    View Slide

  7. On “Sand Won’t Save You This Time”
    "...the operator is confronted with the problem of coping
    with a metal-fluorine fire.
    For dealing with this situation, I have always
    recommended a good pair of running shoes."

    View Slide

  8. Unicode
    Locales
    Time
    Calendars
    Geography
    Money

    View Slide

  9. Network latency
    Hardware unreliability
    Deadlocks
    Bit flips
    Ambiguous specifications
    No documentation

    View Slide

  10. We just move faster and hit them at higher speed.
    Not unique to software

    View Slide

  11. View Slide

  12. Who's solved this? Aviation.

    View Slide

  13. A Boeing 747 has six million parts

    View Slide

  14. …and a 0.000006% accident rate
    A Boeing 747 has six million parts

    View Slide

  15. Airplane
    Car
    Walking
    Train
    220
    130
    30.8
    Deaths per billion hours (Per passenger, UK 1990-2000)
    30

    View Slide

  16. People matter as much as machines

    View Slide

  17. Pilot 76%
    Aviation Accident Causes (2005 Nall report)
    9% Other
    16% Mechanical

    View Slide

  18. And how we can apply them to software.
    Let's look at some aviation principles

    View Slide

  19. Principle #1
    Hard Failure

    View Slide

  20. If something is wrong it turns itself off
    Autopilots, engines, air conditioning, and more

    View Slide

  21. This only works if you have redundancy
    All of these systems have a backup that lets you land.

    View Slide

  22. "We'll ignore errors so the site doesn't crash!"
    "Save the invalid data and we'll fix it later"

    View Slide

  23. These are great ways to ensure you
    never fix something.

    View Slide

  24. No accident or outage has a single cause.
    Stop your code getting into odd states.

    View Slide

  25. Fail hard if anything unexpected happens
    Validate all your data strictly in and out
    Deploy changes early and often

    View Slide

  26. Single points of failure can be good
    Only one place to look when things go wrong!

    View Slide

  27. View Slide

  28. Principle #2
    Good Alerting

    View Slide

  29. Cockpits are incredibly selective about
    what sets off an audio alarm

    View Slide

  30. Alert fatigue is real. Avoid at all costs.

    View Slide

  31. Never, ever, put all errors in the same place

    View Slide

  32. Critical
    Normal
    Background

    View Slide

  33. Critical
    Normal
    Background
    Wakes someone up. Actionable.

    View Slide

  34. Critical
    Normal
    Background
    Wakes someone up. Actionable.
    Fixed over the next week.

    View Slide

  35. Critical
    Normal
    Background
    Wakes someone up. Actionable.
    Fixed over the next week.
    Metrics, not errors.

    View Slide

  36. Have you been ignoring an error for weeks?
    Then turn off its error reporting.

    View Slide

  37. Principle #3
    Find your limits

    View Slide

  38. Everything will fail. You should know when.

    View Slide

  39. Copyright Boeing

    View Slide

  40. What's your Minimum Equipment List?
    What can you run the system without?

    View Slide

  41. Lavatory ashtrays
    Air conditioning
    Seatbelt signs
    Passenger video screens
    Fuel caps
    Weather radar
    REQUIRED OPTIONAL

    View Slide

  42. Did you load test? Did you fuzz test?

    View Slide

  43. You don't have to perfectly scale.
    But you do have to know where your limits are.

    View Slide

  44. Risk is fine when you're informed!
    Unknowns are the most dangerous thing.

    View Slide

  45. Principle #4
    Build for failure

    View Slide

  46. No single thing in an aircraft can
    fail and take it down.

    View Slide

  47. We all want this for our code, but
    the way to do it is to build for failure.

    View Slide

  48. Kill your application randomly
    Practice server network failures
    Develop on unreliable connections

    View Slide

  49. The majority of pilot training is
    handling emergencies.

    View Slide

  50. View Slide

  51. Use checklists. Don't rely on memory.

    View Slide

  52. If you practice failure, you'll be ready
    when the inevitable happens.

    View Slide

  53. Pilot 76%
    Aviation Accident Causes (2005 Nall report)
    9% Other
    16% Mechanical

    View Slide

  54. Principle #5
    Communicate well

    View Slide

  55. "You have control"
    "I have control"
    "You have control"

    View Slide

  56. Complex software means
    separate teams.

    View Slide

  57. As you grow, communication becomes
    exponentially harder.

    View Slide

  58. View Slide

  59. View Slide

  60. View Slide

  61. Clear communication is vital.

    View Slide

  62. Write everything down.
    Written specs = less time in meetings.

    View Slide

  63. Have a clear chain of command.

    View Slide

  64. Make decisions.
    They don't have to be perfect, just good enough.

    View Slide

  65. Principle #6
    No blame culture

    View Slide

  66. How do I know all these aviation stats?

    View Slide

  67. Every incident is reported and investigated.

    View Slide

  68. There is never a single cause of a problem.

    View Slide

  69. Make it very difficult to do again.
    Why did your software let this happen? What's the UX of your admin tools like?

    View Slide

  70. View Slide

  71. View Slide

  72. Encourage reporting.
    Don't blame anyone for a mistake. They're unlikely to make it again.

    View Slide

  73. Reward maintenance as well as firefighting
    It's easy to look good when you ship broken and are always heroically fixing it.

    View Slide

  74. View Slide

  75. In aviation, every rule is written in blood.

    View Slide

  76. Software is not yet there.
    But we are getting closer.

    View Slide

  77. Margaret Hamilton
    Her error detection code saved Apollo 11

    View Slide

  78. Patriot Missile
    Floating-point bug killed 28

    View Slide

  79. Therac-25
    Killed 3, severely injured
    at least 3 more

    View Slide

  80. Uber Autonomous
    Vehicle
    Saw a pedestrian and chose
    to hit her

    View Slide

  81. View Slide

  82. Hard failure
    Good alerting
    Find your limits
    Build for failure
    Communicate well
    No blame culture

    View Slide

  83. Thanks.
    Andrew Godwin
    @andrewgodwin aeracode.org

    View Slide