Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

What can Engineers learn from Aviation & Space? @yurynino www.yurynino.dev Inspired by Alon Altman SRE EDU

Slide 3

Slide 3 text

● "SRE is what happens when you ask a software engineer to design an operations function". ● Originated at Google in 2003. ● Framework for operating large scale systems reliably. ● Focuses on running systems in production What is Site Reliability Engineering? www.yurynino.dev

Slide 4

Slide 4 text

Site Reliability Engineering Principles 1 SRE needs Service Level Objectives (SLOs), with consequences. 2 SREs must have time to make tomorrow better than today. 3 SRE teams have the ability to regulate their workload. 4 Failure is an opportunity to improve. www.yurynino.dev

Slide 5

Slide 5 text

https://flyingbarron.medium.com/glynn-l unney-sre-leadership-9b34ed34eee8 What SRE learn from Glynn and Spacecraft

Slide 6

Slide 6 text

www.yurynino.dev

Slide 7

Slide 7 text

www.yurynino.dev

Slide 8

Slide 8 text

US Airways Flight 1549 Hudson River in New York January 15TH, 2009. https://en.wikipedia.org/wiki/US_Airways_Flight_1549

Slide 9

Slide 9 text

● During climbout, the plane struck a flock of Canada geese at an altitude of 2,818 feet. ● Realizing that both engines had shut down, Sullenberger took control while Skiles worked the checklist for engine restart. ● LaGuardia's tower directed Sullenberger back to Teterboro. ● Sullenberger initially responded "Yes", but then: "We can't do it ... We're gonna be in the Hudson". ● All 155 people on board were rescued. What happened? www.yurynino.dev

Slide 10

Slide 10 text

Lessons Learned ● Good decision-making and teamwork by the cockpit crew. ● The performance of the flight crew during the evacuation. ● The A320 is certified for extended overwater operation. ● Improved pilot training for water landings. ● The report made 34 recommendations. www.yurynino.dev

Slide 11

Slide 11 text

Tenerife Airport March 27th, 1977

Slide 12

Slide 12 text

Radio Communication ● KLM: KLM 4805 is now ready for take-off and we are waiting for our ATC clearance. ● TOWER: KLM 4805 you are cleared to the Papa beacon, climb to and maintain [...] ● KLM: Ah roger sir, [...] We are now at take-off. www.yurynino.dev

Slide 13

Slide 13 text

Cockpit conversation on KLM plane in Dutch: ● Flight Eng: Is he not clear then? ● Captain: What do you say? ● Flight Eng: Is he not clear, that Pan American? ● Captain: Oh yes ● 583 Dead - 61 Injured PAN-AM(RADIO): OK, will report when we're clear. www.yurynino.dev

Slide 14

Slide 14 text

Lessons Learned ● Unclear communications between teams. ● Junior team member noticed the mistake, but was ignored. ● If you do not have observability, you should not make decisions. ● Cockpit procedures were also changed after the accident. ● The course of action was later expanded into what is known today as crew resource management (CRM), which states that all pilots, no matter how experienced they are, are allowed to contradict each other. www.yurynino.dev

Slide 15

Slide 15 text

Lion Air Flight 610 29 October 2018, the Boeing 737 MAX

Slide 16

Slide 16 text

What happened? ● Preliminary investigations revealed serious flight control problems in a sensor and other instruments tied to a design flaw. ● Involving the Maneuvering Characteristics Augmentation System (MCAS) of the MAX series. ● The replacement sensor that was installed had been mis-calibrated during an earlier repair. ● MCAS was designed to rely on a single AOA sensor, making it vulnerable to erroneous input from that sensor. www.yurynino.dev

Slide 17

Slide 17 text

Lessons Learned ● During the design and certification of the Boeing 737-8 (MAX) we can not make assumptions about flight crew response to malfunctions. ● Guidance on MCAS or more detailed use of flight manuals could have been avoided that crews to properly respond to uncommanded MCAS. ● Documentation in the aircraft flight and maintenance log is required. ● As a result, the United States Federal Aviation Administration and Boeing issued warnings and training advisories to all operators of the MAX series to avoid letting the MCAS cause similar tragedies. www.yurynino.dev

Slide 18

Slide 18 text

United Kingdom January 8, 1989 British Midland Flight 92 from Heathrow to Belfast January 8th, 1989

Slide 19

Slide 19 text

What happened? Smoke in the Cockpit ● CAPTAIN: Which Engine is it? ● FIRST OFFICER: It's the le.. It's the right one! ● CAPTAIN: Okay, throttle it back. ● CAPTAIN: Shut it down. ● PA: We had a problem with the right engine and shut it down. ● 47 Dead - 74 Injured www.yurynino.dev

Slide 20

Slide 20 text

Lessons Learned ● Always re-verify your working assumptions. ● Wrong mitigation actions may appear to work due to coincidence or complex behaviors. ● Personal knowledge should be tested against a changing environment. www.yurynino.dev

Slide 21

Slide 21 text

Überlingen, Germany July 1st, 2002 DHL TU Zürich

Slide 22

Slide 22 text

What happened? ● Meanwhile, TCAS gives BTC2937 a command to climb and DHL611 a command to descend. ● Controller gives BTC2937 command to descend. ● BTC2937 pilots decide that the human is better informed than the computer and follows the human's instructions. ● Both planes proceed to descend, still on a collision course. ● 71 Dead No survivors www.yurynino.dev

Slide 23

Slide 23 text

What happened? ● Zürich Air Traffic Control center under maintenance ○ Ground-based conflict alert offline ○ System slow to respond ○ Phones not working ● Only one controller on duty, monitoring two different screens. ● Controller under high workload, and cannot transfer to other center. ● Controller notices planes in conflict late. www.yurynino.dev

Slide 24

Slide 24 text

Lessons Learned ● Failsafe systems may surprise human operators. ● Explicit protocols to resolve conflict between different sources must be placed in advance. ● Hope Is Not A Strategy! ○ Critical functions must be properly staffed. www.yurynino.dev

Slide 25

Slide 25 text

Resilience Engineering ● To be able to construct a mental representation of the situation. ● To be able to assess risk and threats as relevant for the flight. ● To be able to switch from a situation under control. ● To be able to maintain a relevant level of confidence. ● To be able to make a decision in a complex. www.yurynino.dev

Slide 26

Slide 26 text

Resilience Engineering ● To be able to make an intelligent usage of procedures. ● To be able to use available technical and human resources. ● To be able to manage time and time pressure. ● To be able to cooperate with, crew members and other staff. ● To be able to properly use and manage information. www.yurynino.dev

Slide 27

Slide 27 text

Inspired by Alon Altman SRE EDU