Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Why do computers stop and what can be done about it?

Why do computers stop and what can be done about it?

Presentation from Jim Gray's technical report "Why do computer stop and what can be done about it?" presented at the Advanced Operating System's seminar at DCC/UFMG.

Presentation (PT-BR): https://www.youtube.com/watch?v=sfv0QwcWlTU
Original report: https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

Lucas Bleme

March 14, 2021
Tweet

More Decks by Lucas Bleme

Other Decks in Science

Transcript

  1. Advanced Operating Systems
    Lucas Bleme
    Why Do Computers Stop
    and
    What Can Be Done About It?

    View full-size slide

  2. Agenda
    Problem
    (motivation)
    Fault Tolerant
    Execution
    Fault Tolerant
    Communication
    Fault Tolerant
    Storage
    Related Works Final Evaluation

    View full-size slide

  3. High availability systems
    ● Patient monitoring, aircontrol, online transaction, geo navigation
    ● Part of the system may fail but overall, it must tolerate failures
    ● Availability != Reliability

    View full-size slide

  4. Experiment
    ● Interview to gather system failures reported over 7 months; 2000 systems from
    Tandem Systems; 10K execution hours;
    ● 166 failures reported; 11 years MTBF (meantime between failures);
    ● Failures reported do not cover power outages < 4h and "infant mortality" failures.

    View full-size slide

  5. Source: https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

    View full-size slide

  6. Source: https://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf

    View full-size slide

  7. Overall solution
    ● Reduce administrative mistakes by making self-configured systems with minimal
    maintenance and operator interaction
    ● Avoid immature products that still suffers from "infant mortality"
    ● Create fault tolerance software (execution, communication and storage)

    View full-size slide

  8. Fault Tolerant Execution
    ● Decompose large system into modules: single unit of service and failure
    ● Fail fast: function correctly or stop operating (defensive programming)
    ● Share no state between processes, contact via messages carried by
    message systems

    View full-size slide

  9. Fault Tolerant Execution
    ● Take leverage of the "Heisenbug" hypothesis by relying on
    re-execution of failing components
    ● Using Functional Recovery Routines (FRRs) caused a 76% success
    rate on recovering system execution in MVS/XA, increasing its MTBF
    by a factor of 4
    Source: https://dl.acm.org/doi/abs/10.1109/TSE.1987.232855

    View full-size slide

  10. Fault Tolerant Execution
    Process-pairs design approaches:
    ● Lockstep: the primary and the backup processes synchronously execute
    the same instruction stream on independent processors
    ● State checkpoint: the primary process sends the state changes to its
    backup process prior to each major event
    ● Delta checkpoint: logically sends to the backup process the changes,
    improving performance and increasing isolation

    View full-size slide

  11. Fault Tolerant Execution
    Persistent process-pairs + transactions:
    ● Provides a consistent transformation of state grouping the operations
    ● Allow programmers to reset everything back to the beginning of the
    transaction (undo changes)
    ● Combined with persistent process pairs it covers hardware faults and
    Heisenbugs, promoting robustness

    View full-size slide

  12. Fault Tolerant Communication
    ● The most unreliable part of a distributed system
    ● Using timeout, message sequence numbers and sessions make
    process-pairs to work properly
    ● Interaction between transaction and session

    View full-size slide

  13. Fault Tolerant Storage
    ● Remote replicas having different administrators, hardware and environment
    protects against 75% of the failures (all non-software failures)
    ● Use transactions to coordinate data updates, assuring that all or none of
    them apply
    ● Partitioning data among discs and nodes to limit the scope of failures

    View full-size slide

  14. Wrapping Up
    ● Hardware is a minor contributor to system outages *
    ● Apply modularity, defensive programming, process-pairs, and tolerating soft
    faults (Heisenbugs) increases system's MTBF
    ● Resummable sessions provides fault-tolerant communications
    ● Transaction atomicity coordinates data changes provides fault-tolerant
    storage
    * "in the future hardware will be even more reliable due to better design, increased levels of integration, and reduces numbers of connectors", (GRAY; JIM, 1985, p. 12).

    View full-size slide

  15. Related Work
    [Mourad] Mourad, S. and Andrews, D., "The Reliability of the Operating System", Digest of 15th
    Annual Int. Sym. on Tolerant Computing, June 1985. IEEE Computer Society Press.
    [Adams] Adams, Products", E., "Optimizing Preventative Service of Software IBM J. Res. and Dev.,
    Vol. 28, No.1, Jan. 1984.
    [Borg] Borg, A., Baumbach, J., Glazer, S., "A Message System Supporting Fault-tolerance", ACM
    OS Review, Vol. 17, No.5, 1984.

    View full-size slide

  16. Final Evaluation
    The report provides great insights on fault tolerant systems' design.
    Modern approaches for building resilient distributed systems such as Circuit
    Breaker, Distributed Caching, and Read Replicas share the same goals with
    the presented techniques for fault-tolerant execution, communication and
    storage.

    View full-size slide

  17. Thank you! https://speakerdeck.com/andreybleme
    Lucas Bleme
    [email protected]
    Análise de Ataques DDoS

    View full-size slide