Building Safety Critical Systems

Building Safety Critical Systems Strange Loop 2022 @bellmar

About Me • Author of “Kill It With Fire” •
20+ years of software experience • Specialities: ◦ System dynamics ◦ Applied formal methods ◦ Architecture and system rescue • Engineering manager at Rebellion Defense

The Workshop • Overview of Safety ◦ Traditional engineering idea
of safety ◦ Safety as ergonomics and systems thinking • The Role of Models and Specification in Safety • Building Models ◦ Common problems ◦ Approaches ◦ Verification hot spots • Drafting a Model ◦ Ground to Takeoff transition ◦ Develop a model ◦ Feedback and discussion

The Workshop • Workshop not a two hour lecture! ◦
Interactive! ◦ Small group work ◦ Be prepared to move a little bit • Please respect everyone’s threat models ◦ If asked to mask up by neighbors, please do so. I have masks available • Join the channel #workshop-safety • Make sure you have a scrap paper pack and a pen ;)

What do we mean when we say something is safe?

How Safe Are These Things? A B C

By Number of Injuries… 1,916,000 120,000 125,000 NHTSA, JAMA Internal
Medicine, OSHA

By Number of Deaths… 35,000 40,000 < 100 NHTSA, USA
Facts, OSHA

What Do We Mean When We Say “Safe”? • Is
it unsafe if it’s a contributing factor? ◦ 1.4 million accidents involving ﬂip ﬂops (Sheilas' Wheels, 2013) ◦ An estimated 1,200 people die in fatal muggings over their sneakers (GQ, 2015) • Is it unsafe if the harm is intentional? (CDC, 2013) ◦ 99.4% of car deaths are accidental ◦ Only 4% of gun deaths are accidental. ◦ 65% of gun fatalities are suicides.

What Do We Mean When We Say “Safe”? • The
traditional view of Safety Critical: ◦ Likelihood of hazard (SLO) ◦ Unit testing ◦ Conﬁguration control ◦ Formal change management ◦ Software of unknown pedigree • Speciﬁc to industries: ◦ Aerospace: DO-178C ◦ Rail: EN 50126, EN 50129 ◦ Automotive: ISO 26262 ◦ Nuclear: IEC 61513 • IEC 61508 ← closest thing to a general standard

What Do We Mean When We Say “Safe”? • The
traditional view of Safety Critical: Credit: CMU SEI, 2013

What Do We Mean When We Say “Safe”? • Safety
engineering focuses on formally verifying that the technology will adhere to its requirements in all situations • When it’s impossible to verify safety engineering estimates the likelihood of failure and assembles an acceptable risk budget. Similar to an SLO ◦ Hardware can always break ◦ “Soft” real time constraints • Traditional safety engineering relies on the requirements of safe operation being clearly deﬁned. • As systems get more intertwined and complex we end up with more software that we do not think of a “safety critical” but can nevertheless cause problems.

What Do We Mean When We Say “Safe”? • Risk
◦ Does the operator understand the risks of using the technology in a particular way? ◦ Does the technology hide something about the context ▪ Distraction ▪ Misleading/Confusing • Mitigation ◦ Can the operator stop unsafe events in progress? ◦ Is operation predictable/deterministic?

Operator can assess the risk + Operator can mitigate risk

What Do We Mean When We Say “Safe”? Resilient Available
Accessible Controlled Veriﬁed Reliable Explainable Fault-Tolerant

What Do We Mean When We Say “Safe”? Resilient Available
Accessible Controlled Veriﬁed Reliable Explainable Fault-Tolerant Group A Group B

Is This System Safe? I’m available at the toy store!

Is This System Safe? I’m available at the toy store!
I’m a flying lawnmower!

How to Make This System Safe?

Activity

High-Assurance Cyber Military Systems (HACSM) • seL4 Kernel ◦ highest
assurance of isolation between applications running in the system • Model system in AADL (Architecture Analysis & Design Language) • Check model • Separate functions into veriﬁable components • Write components in domain speciﬁc language that eliminate bad habits in C

But How to Find Veriﬁable Components?

Reasoning About Systems • What are the parts of the
system? • How do they interact? • What is the expected behavior? • How do the parts create the behavior?

Model This System • Online REPL ◦ Web based application
◦ Frontend where people enter code ◦ Button to run/execute code ◦ Results displayed back to the user • Examples ◦ https://go.dev/play/ ◦ https://www.ideone.com/ ◦ https://www.codiva.io/

Activity

Model This System

Problems with Models • Most engineers start by modeling how
a system looks ◦ What hardware is there? ◦ What instances are there? ◦ What protocols/APIs do they interact over? • This doesn’t tell us anything about how the system behaves, which is what we need to verify.

Model This System

Problems with Models • Creating blindspots by self selecting what
areas of behavior are important ◦ How consistent is your scope across the model? ◦ How do we know the generalization are correct? • Only shows you problems you already know about • Thinking in axioms: ◦ We’re making an assumption that this state is always true ◦ We document and monitor it.

That being said…. Even models with biases and ﬂaws can
be useful. The process of writing them down often triggers ah-ha! moments

Model This System

Model This System When does the container manager respond and
does that introduce race conditions?

Cheat Sheet to Veriﬁcation • What can connect to what?
◦ Identity ◦ Policy ◦ Shared resources (memory) • Will processes return in time? ◦ Time deadlines ◦ Concurrency issues • How do we transition between states? ◦ What is the correct behavior? ◦ What is the impossible behavior?

Who Can Talk to Whom? Flight Supervisor Sensors VM Mission
UI

What Processes Are Time Sensitive? Flight Supervisor Sensors VM Mission
UI

What Causes State Transitions? Flight Supervisor Ground Takeoff Hovering Flying
Landing Critical

Ground Takeoff Hovering Flying Landing Critical • Idle • Calibration
• Normal • Hand • Manual • Flightplan • Followme • Lookat • Point of Interest (POI) • Return to Home (RTH) • Normal • Hand • Critical RTH • Critical Landing • Emergency Landing • Emergency Ground

Operator can assess the risk + Operator can mitigate risk

Unsafe States • Sensor failure • Latency • Losing line
of sight • Connectivity • Weather conditions • Projectiles/obstacles Assess the Risk Mitigate the Risk

Ground Takeoff Hovering Flying Landing Critical Accelerometer Gyroscope Magnetic compass
Barometer GPS Sensor Distance Sensor (ultrasonic, laser or LIDAR)

Draft a Model • Pick a transition between two states
◦ What are the components (software, sensors, hardware) ◦ What are the steps that create the transition? ◦ What controls or fall backs are involved? • Design some tests ◦ What should be impossible? ◦ What should eventually be true? ◦ What should always be true? ◦ How do we know?

Ground Takeoff Barometer Distance Sensor (ultrasonic, laser or LIDAR) •
Get ground elevation • Check hardware system health • Check for obstructions

Can the flight simulator engage the propellers without checking their
status?

Ground Takeoff • Get ground elevation • Check hardware system
health • Check for obstructions Impossible: • Takeoff if hardware in failure • Takeoff if operator too close Fallback: • If sensor fails, do not allow takeoff

Okay but wait a minute…

Pending

Determining States • Going back to ﬁnite state machines ◦
Moore machines: f(state) → state’ ◦ Mealy machines: f(state, input) → state’ • A state is a product of a previous state or a previous state AND an input • What are the inputs to the component? ◦ Distance sensor: query from ﬂight supervisor ◦ Does Pending + Query = a distinct state not otherwise recorded?

Determining States • There are no inputs that can combine
with a Pending state to produce a state we don’t otherwise know about. • But this depends on the system and the behavior we wish to model • Database ◦ Inputs: Read query, write query ◦ Pending write + Read query = blocked

Generalizing Inputs • Some inputs have inﬁnite states ◦ Strings!
◦ Numbers • Again, the key issue is distinct state change, so to model these cases we tend to sort those inputs into categories ◦ Valid/invalid ◦ Less than threshold/More than threshold • Typically we don’t use magic numbers in specs but it’s okay to do that when you’re starting out.

Draft a Model • Pick a transition between two states
◦ What are the components (software, sensors, hardware) ◦ What are the inputs of each component? ◦ What are the steps that create the transition? ◦ What controls or fall backs are involved? • Design some tests ◦ What should be impossible? ◦ What should eventually be true? ◦ What should always be true? ◦ How do we know?

Activity

Thank You! • Check #workshop-safety for slides and resources •
Give me feedback: ◦ https://forms.gle/1PTdSNB4xmVfcCco9 • Watch this space → bellotti.tech

Building Safety Critical Systems

Building Safety Critical Systems

More Decks by Marianne Bellotti

Featured

Transcript