Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why do computers stop and what can be done about it?

Why do computers stop and what can be done about it?

Presentation from Jim Gray's technical report "Why do computer stop and what can be done about it?" presented at the Advanced Operating System's seminar at DCC/UFMG.

Presentation (PT-BR): https://www.youtube.com/watch?v=sfv0QwcWlTU
Original report: https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Lucas Bleme

March 14, 2021
Tweet

More Decks by Lucas Bleme

Other Decks in Science

Transcript

  1. High availability systems • Patient monitoring, aircontrol, online transaction, geo

    navigation • Part of the system may fail but overall, it must tolerate failures • Availability != Reliability
  2. Experiment • Interview to gather system failures reported over 7

    months; 2000 systems from Tandem Systems; 10K execution hours; • 166 failures reported; 11 years MTBF (meantime between failures); • Failures reported do not cover power outages < 4h and "infant mortality" failures.
  3. Overall solution • Reduce administrative mistakes by making self-configured systems

    with minimal maintenance and operator interaction • Avoid immature products that still suffers from "infant mortality" • Create fault tolerance software (execution, communication and storage)
  4. Fault Tolerant Execution • Decompose large system into modules: single

    unit of service and failure • Fail fast: function correctly or stop operating (defensive programming) • Share no state between processes, contact via messages carried by message systems
  5. Fault Tolerant Execution • Take leverage of the "Heisenbug" hypothesis

    by relying on re-execution of failing components • Using Functional Recovery Routines (FRRs) caused a 76% success rate on recovering system execution in MVS/XA, increasing its MTBF by a factor of 4 Source: https://dl.acm.org/doi/abs/10.1109/TSE.1987.232855
  6. Fault Tolerant Execution Process-pairs design approaches: • Lockstep: the primary

    and the backup processes synchronously execute the same instruction stream on independent processors • State checkpoint: the primary process sends the state changes to its backup process prior to each major event • Delta checkpoint: logically sends to the backup process the changes, improving performance and increasing isolation
  7. Fault Tolerant Execution Persistent process-pairs + transactions: • Provides a

    consistent transformation of state grouping the operations • Allow programmers to reset everything back to the beginning of the transaction (undo changes) • Combined with persistent process pairs it covers hardware faults and Heisenbugs, promoting robustness
  8. Fault Tolerant Communication • The most unreliable part of a

    distributed system • Using timeout, message sequence numbers and sessions make process-pairs to work properly • Interaction between transaction and session
  9. Fault Tolerant Storage • Remote replicas having different administrators, hardware

    and environment protects against 75% of the failures (all non-software failures) • Use transactions to coordinate data updates, assuring that all or none of them apply • Partitioning data among discs and nodes to limit the scope of failures
  10. Wrapping Up • Hardware is a minor contributor to system

    outages * • Apply modularity, defensive programming, process-pairs, and tolerating soft faults (Heisenbugs) increases system's MTBF • Resummable sessions provides fault-tolerant communications • Transaction atomicity coordinates data changes provides fault-tolerant storage * "in the future hardware will be even more reliable due to better design, increased levels of integration, and reduces numbers of connectors", (GRAY; JIM, 1985, p. 12).
  11. Related Work [Mourad] Mourad, S. and Andrews, D., "The Reliability

    of the Operating System", Digest of 15th Annual Int. Sym. on Tolerant Computing, June 1985. IEEE Computer Society Press. [Adams] Adams, Products", E., "Optimizing Preventative Service of Software IBM J. Res. and Dev., Vol. 28, No.1, Jan. 1984. [Borg] Borg, A., Baumbach, J., Glazer, S., "A Message System Supporting Fault-tolerance", ACM OS Review, Vol. 17, No.5, 1984.
  12. Final Evaluation The report provides great insights on fault tolerant

    systems' design. Modern approaches for building resilient distributed systems such as Circuit Breaker, Distributed Caching, and Read Replicas share the same goals with the presented techniques for fault-tolerant execution, communication and storage.