Why do computers stop and what can be done about it?

Advanced Operating Systems Lucas Bleme Why Do Computers Stop and
What Can Be Done About It?

Agenda Problem (motivation) Fault Tolerant Execution Fault Tolerant Communication Fault
Tolerant Storage Related Works Final Evaluation

High availability systems • Patient monitoring, aircontrol, online transaction, geo
navigation • Part of the system may fail but overall, it must tolerate failures • Availability != Reliability

Experiment • Interview to gather system failures reported over 7
months; 2000 systems from Tandem Systems; 10K execution hours; • 166 failures reported; 11 years MTBF (meantime between failures); • Failures reported do not cover power outages < 4h and "infant mortality" failures.

Source: https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Overall solution • Reduce administrative mistakes by making self-configured systems
with minimal maintenance and operator interaction • Avoid immature products that still suffers from "infant mortality" • Create fault tolerance software (execution, communication and storage)

Fault Tolerant Execution • Decompose large system into modules: single
unit of service and failure • Fail fast: function correctly or stop operating (defensive programming) • Share no state between processes, contact via messages carried by message systems

Fault Tolerant Execution • Take leverage of the "Heisenbug" hypothesis
by relying on re-execution of failing components • Using Functional Recovery Routines (FRRs) caused a 76% success rate on recovering system execution in MVS/XA, increasing its MTBF by a factor of 4 Source: https://dl.acm.org/doi/abs/10.1109/TSE.1987.232855

Fault Tolerant Execution Process-pairs design approaches: • Lockstep: the primary
and the backup processes synchronously execute the same instruction stream on independent processors • State checkpoint: the primary process sends the state changes to its backup process prior to each major event • Delta checkpoint: logically sends to the backup process the changes, improving performance and increasing isolation

Fault Tolerant Execution Persistent process-pairs + transactions: • Provides a
consistent transformation of state grouping the operations • Allow programmers to reset everything back to the beginning of the transaction (undo changes) • Combined with persistent process pairs it covers hardware faults and Heisenbugs, promoting robustness

Fault Tolerant Communication • The most unreliable part of a
distributed system • Using timeout, message sequence numbers and sessions make process-pairs to work properly • Interaction between transaction and session

Fault Tolerant Storage • Remote replicas having different administrators, hardware
and environment protects against 75% of the failures (all non-software failures) • Use transactions to coordinate data updates, assuring that all or none of them apply • Partitioning data among discs and nodes to limit the scope of failures

Wrapping Up • Hardware is a minor contributor to system
outages * • Apply modularity, defensive programming, process-pairs, and tolerating soft faults (Heisenbugs) increases system's MTBF • Resummable sessions provides fault-tolerant communications • Transaction atomicity coordinates data changes provides fault-tolerant storage * "in the future hardware will be even more reliable due to better design, increased levels of integration, and reduces numbers of connectors", (GRAY; JIM, 1985, p. 12).

Related Work [Mourad] Mourad, S. and Andrews, D., "The Reliability
of the Operating System", Digest of 15th Annual Int. Sym. on Tolerant Computing, June 1985. IEEE Computer Society Press. [Adams] Adams, Products", E., "Optimizing Preventative Service of Software IBM J. Res. and Dev., Vol. 28, No.1, Jan. 1984. [Borg] Borg, A., Baumbach, J., Glazer, S., "A Message System Supporting Fault-tolerance", ACM OS Review, Vol. 17, No.5, 1984.

Final Evaluation The report provides great insights on fault tolerant
systems' design. Modern approaches for building resilient distributed systems such as Circuit Breaker, Distributed Caching, and Read Replicas share the same goals with the presented techniques for fault-tolerant execution, communication and storage.

Thank you! https://speakerdeck.com/andreybleme Lucas Bleme [email protected] Análise de Ataques DDoS

Why do computers stop and what can be done abou...

Why do computers stop and what can be done about it?

Lucas Bleme

More Decks by Lucas Bleme

Other Decks in Science

Featured

Transcript

Advanced Operating Systems Lucas Bleme Why Do Computers Stop and

Agenda Problem (motivation) Fault Tolerant Execution Fault Tolerant Communication Fault

High availability systems • Patient monitoring, aircontrol, online transaction, geo

Experiment • Interview to gather system failures reported over 7

Source: https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Source: https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Overall solution • Reduce administrative mistakes by making self-configured systems

Fault Tolerant Execution • Decompose large system into modules: single

Fault Tolerant Execution • Take leverage of the "Heisenbug" hypothesis

Fault Tolerant Execution Process-pairs design approaches: • Lockstep: the primary

Fault Tolerant Execution Persistent process-pairs + transactions: • Provides a

Fault Tolerant Communication • The most unreliable part of a

Fault Tolerant Storage • Remote replicas having different administrators, hardware

Wrapping Up • Hardware is a minor contributor to system

Related Work [Mourad] Mourad, S. and Andrews, D., "The Reliability

Final Evaluation The report provides great insights on fault tolerant

Thank you! https://speakerdeck.com/andreybleme Lucas Bleme [email protected] Análise de Ataques DDoS