Andrew Godwin - What can programmers learn from pilots?

Andrew Godwin - What can programmers learn from pilots?

What can Python-based software teams learn from aviation? Why should software always fail hard? What's wrong with too many error logs? And why are ops people already like pilots? Learn all this, and about planes, too.

https://us.pycon.org/2015/schedule/presentation/375/

D5710b3bca38f1233274b4cbc523dc4b?s=128

PyCon 2015

April 18, 2015
Tweet

Transcript

  1. Andrew Godwin @andrewgodwin Programmers LEARN FROM WHAT CAN Pilots?

  2. Andrew Godwin Hi, I'm Author of 1.7 Django & South

    migrations Senior Software Engineer at Really likes cheese FAA & EASA PPL, working on IR
  3. flickr.com/photos/russss/16735398019/

  4. Learning about aviation Applying lessons to coding 1 2

  5. Commercial flying is very safe AIRLINES GA 0.2 11.2 CARS/TRUCKS

    0.53 MOTORCYCLES 15.6 Source: 2005 Nall report, 2004 NHTSA stats, 1991-2000 FAA stats, 40mph avg. road speed (fatal accidents per million hours) General aviation is still not bad
  6. Pilot Source: 2005 Nall report Mechanical Other 76% 16% 9%

    GA ACCIDENT CAUSES
  7. COMMON CAUSES Controlled flight into terrain (CFIT) Disorientation in clouds

    (VFR in IMC) Bad decision making (get-there-itis)
  8. WHY DO I KNOW THIS? Detailed investigation of every accident

  9. HOW DOES IT HELP US? Let's look at common problems

  10. Soft Failure Explicit disengage signals Covering inaccurate instruments Replacing parts

    at first sign of issues
  11. Soft Failure Crash hard on any serious error Redundancy, not

    single system reliability Freedom to get rid of servers whenever
  12. Noisy Warnings Limited number of warning sounds Clear, unambiguous text

    & speech No constant low-level warnings
  13. Noisy Warnings Don't email/notify on every tiny error Choose 5

    top errors, solve them first If you ignore it for a week, delete the warning
  14. Poor Testing Every part tested to destruction Well known statistical

    limits Knowing when, not if, things fail
  15. Image: © Boeing 2010

  16. Poor Testing Test latency, memory issues, dodgy network and other

    unusual things Interactions are as important as individual units
  17. Automation Reliance Tested without autopilot/instruments Plane usually advises, rarely controls

    Easy to see what's happening and why
  18. flickr.com/photos/wkharmon/4631001766

  19. Automation Reliance Don't rely on magical automatic failover Regularly practice

    manual recovery steps Know what your systems are doing
  20. People Reliance Checklists for everything Warnings built around common assumptions

    Reduce workload at critical times
  21. None
  22. People Reliance Checklists for releases/testing/onboarding Automate common tasks Reduce workload

    at critical times
  23. Bad Priorities Aviate, Navigate, Communicate Minimum Equipment Lists Mayday priority

  24. Minimum Equipment Quiz Passenger video screens Lavatory ashtrays Air conditioning

    Fuel recepticle caps Seatbelt signs Weather radar
  25. Minimum Equipment Quiz Passenger video screens Lavatory ashtrays Air conditioning

    Fuel recepticle caps Seatbelt signs Weather radar
  26. Margaret Hamilton

  27. Bad Priorities What are your critical features? What can you

    do without? Know what you want to fix first and test most
  28. Unclear Responsibility Single person always in command Others are always

    listened to Clear, concise communication
  29. Unclear Responsibility Single person makes key decisions Others are always

    listened to Clear specifications and expectations
  30. Blame Culture There is never a single cause of an

    accident Individual problems identified and addressed Blaming someone solves nothing
  31. Blame Culture There is never a single cause of a

    problem Work back and find all of the bad factors Blaming people makes things worse
  32. Deadlines Always carry extra fuel Always have an alternate Land

    safely rather than at the destination
  33. Deadlines Don't schedule everyone at maximum Always expect unknown problems

    Ship good code rather than to a deadline
  34. Takeaways

  35. Checklists First step before automation

  36. Filter unimportant errors Keep ignoring it? It's not important.

  37. Pick your key features Don't worry about breaking minor stuff

  38. Reward good decisions It's often not the people staying late

  39. Ops are like pilots Boredom punctuated by moments of terror

  40. Thanks. Andrew Godwin @andrewgodwin eventbrite.com/jobs