Upgrade to Pro — share decks privately, control downloads, hide ads and more …

What can Engineers learn from Aviation & Space?

What can Engineers learn from Aviation & Space?

Yury Nino

April 23, 2022
Tweet

More Decks by Yury Nino

Other Decks in Technology

Transcript

  1. • "SRE is what happens when you ask a software

    engineer to design an operations function". • Originated at Google in 2003. • Framework for operating large scale systems reliably. • Focuses on running systems in production What is Site Reliability Engineering? www.yurynino.dev
  2. Site Reliability Engineering Principles 1 SRE needs Service Level Objectives

    (SLOs), with consequences. 2 SREs must have time to make tomorrow better than today. 3 SRE teams have the ability to regulate their workload. 4 Failure is an opportunity to improve. www.yurynino.dev
  3. US Airways Flight 1549 Hudson River in New York January

    15TH, 2009. https://en.wikipedia.org/wiki/US_Airways_Flight_1549
  4. • During climbout, the plane struck a flock of Canada

    geese at an altitude of 2,818 feet. • Realizing that both engines had shut down, Sullenberger took control while Skiles worked the checklist for engine restart. • LaGuardia's tower directed Sullenberger back to Teterboro. • Sullenberger initially responded "Yes", but then: "We can't do it ... We're gonna be in the Hudson". • All 155 people on board were rescued. What happened? www.yurynino.dev
  5. Lessons Learned • Good decision-making and teamwork by the cockpit

    crew. • The performance of the flight crew during the evacuation. • The A320 is certified for extended overwater operation. • Improved pilot training for water landings. • The report made 34 recommendations. www.yurynino.dev
  6. Radio Communication • KLM: KLM 4805 is now ready for

    take-off and we are waiting for our ATC clearance. • TOWER: KLM 4805 you are cleared to the Papa beacon, climb to and maintain [...] • KLM: Ah roger sir, [...] We are now at take-off. www.yurynino.dev
  7. Cockpit conversation on KLM plane in Dutch: • Flight Eng:

    Is he not clear then? • Captain: What do you say? • Flight Eng: Is he not clear, that Pan American? • Captain: Oh yes • 583 Dead - 61 Injured PAN-AM(RADIO): OK, will report when we're clear. www.yurynino.dev
  8. Lessons Learned • Unclear communications between teams. • Junior team

    member noticed the mistake, but was ignored. • If you do not have observability, you should not make decisions. • Cockpit procedures were also changed after the accident. • The course of action was later expanded into what is known today as crew resource management (CRM), which states that all pilots, no matter how experienced they are, are allowed to contradict each other. www.yurynino.dev
  9. What happened? • Preliminary investigations revealed serious flight control problems

    in a sensor and other instruments tied to a design flaw. • Involving the Maneuvering Characteristics Augmentation System (MCAS) of the MAX series. • The replacement sensor that was installed had been mis-calibrated during an earlier repair. • MCAS was designed to rely on a single AOA sensor, making it vulnerable to erroneous input from that sensor. www.yurynino.dev
  10. Lessons Learned • During the design and certification of the

    Boeing 737-8 (MAX) we can not make assumptions about flight crew response to malfunctions. • Guidance on MCAS or more detailed use of flight manuals could have been avoided that crews to properly respond to uncommanded MCAS. • Documentation in the aircraft flight and maintenance log is required. • As a result, the United States Federal Aviation Administration and Boeing issued warnings and training advisories to all operators of the MAX series to avoid letting the MCAS cause similar tragedies. www.yurynino.dev
  11. What happened? Smoke in the Cockpit • CAPTAIN: Which Engine

    is it? • FIRST OFFICER: It's the le.. It's the right one! • CAPTAIN: Okay, throttle it back. • CAPTAIN: Shut it down. • PA: We had a problem with the right engine and shut it down. • 47 Dead - 74 Injured www.yurynino.dev
  12. Lessons Learned • Always re-verify your working assumptions. • Wrong

    mitigation actions may appear to work due to coincidence or complex behaviors. • Personal knowledge should be tested against a changing environment. www.yurynino.dev
  13. What happened? • Meanwhile, TCAS gives BTC2937 a command to

    climb and DHL611 a command to descend. • Controller gives BTC2937 command to descend. • BTC2937 pilots decide that the human is better informed than the computer and follows the human's instructions. • Both planes proceed to descend, still on a collision course. • 71 Dead No survivors www.yurynino.dev
  14. What happened? • Zürich Air Traffic Control center under maintenance

    ◦ Ground-based conflict alert offline ◦ System slow to respond ◦ Phones not working • Only one controller on duty, monitoring two different screens. • Controller under high workload, and cannot transfer to other center. • Controller notices planes in conflict late. www.yurynino.dev
  15. Lessons Learned • Failsafe systems may surprise human operators. •

    Explicit protocols to resolve conflict between different sources must be placed in advance. • Hope Is Not A Strategy! ◦ Critical functions must be properly staffed. www.yurynino.dev
  16. Resilience Engineering • To be able to construct a mental

    representation of the situation. • To be able to assess risk and threats as relevant for the flight. • To be able to switch from a situation under control. • To be able to maintain a relevant level of confidence. • To be able to make a decision in a complex. www.yurynino.dev
  17. Resilience Engineering • To be able to make an intelligent

    usage of procedures. • To be able to use available technical and human resources. • To be able to manage time and time pressure. • To be able to cooperate with, crew members and other staff. • To be able to properly use and manage information. www.yurynino.dev