Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reliability Engineering - Mary Poppendieck

Reliability Engineering - Mary Poppendieck

The days of trying to prevent failures are gone. In today’s high-volume, cloud-based systems, anything that can go wrong will eventually go wrong. It is far better to spend our time engineering fault tolerance than pursuing the impossible goal of fault prevention. Not only is reliability engineering one of the highest paying jobs in software engineering today, it is also a job full of unique challenges that demand creative thinking and problem solving.

This talk is about the multiple aspects of reliability engineering that have become critically important as our world has become increasingly dependent on software systems.

DevOpsDays Zurich

May 14, 2019
Tweet

More Decks by DevOpsDays Zurich

Other Decks in Technology

Transcript

  1. www.poppendieck.com
    [email protected]
    Mary Poppendieck
    Photo © Tom Poppendieck Copyright©2019 Poppendieck.LLC 1

    View Slide

  2. Copyright©2019
    Poppendieck.LLC
    2

    View Slide

  3. Copyright©2019 Poppendieck.LLC 3

    View Slide

  4. Community Dial Office
    Credit: Connections Museum Seattle
    Copyright©2019 Poppendieck.LLC 4

    View Slide

  5. Copyright©2019 Poppendieck.LLC 5
    Creative Commons

    View Slide

  6. Copyright©2019 Poppendieck.LLC 6

    View Slide

  7. 7
    Copyright©2019 Poppendieck.LLC

    View Slide

  8. Bell Telephone Laboratories: No. 2 ESS
    Electronic Switching System for Community Exchanges
    Design goal:
    Cost‐to‐Install and Reliability
    equivalent to existing
    electromechanical systems
    => Reliability goal:
    Maximum 2 hours downtime
    in 40 years
    Copyright©2019 Poppendieck.LLC
    8

    View Slide

  9. Copyright©2019
    Poppendieck.LLC 9
    Redundancy Isolation

    View Slide

  10. 1: Detect Mismatch.
    2: Determine and isolate faulty unit.
    3: Run diagnostics to isolate fault
    to a small number of components.
    4: Print out repair notification.
    5: Resume normal operation upon repair.
    Copyright©2019 Poppendieck.LLC

    View Slide

  11. In 1990, more than half of AT&T’s network crashed after a switch at a
    switching center suffered a minor problem and shut down the center.
    When the center came back up, it sent a message to other centers,
    causing them to trip, shut down and reset. This continued for 9 hours!
    Copyright©2019 Poppendieck.LLC 11
    Redundancy
    Isolation

    View Slide

  12. A Case for Redundant Arrays of Inexpensive Disks (RAID)
    David A Patterson, Garth Gibson, and Randy H Katz
    Copyright©2019 Poppendieck.LLC 12
    ACM SIGMOD Conference, June 1988
    Photo: Creative Commons

    View Slide

  13. Jeff Dean and Sanjay Ghemawat join Google (from DEC).
    Responsible for core infrastructure.
    Copyright©2019
    Poppendieck.LLC
    13
    Computer History Museum
    Google Hardware
    Running BigTable
    Jeff Dean Sanjay Ghemawat
    “Ultimately, it was this frustration
    of being one level removed from
    real users using my work that led
    me to go to a startup.” …Jeff Dean
    Google File System
    and MapReduce
    * Jeff Dean LADIS 2009 Keynote

    View Slide

  14. Enterprise Size Installations (~1000 servers)
    Main cost is people; roughly 1 person : 100 servers
    # of people grows linearly with servers
     Consolidate work onto fewer, larger systems
    Internet scale installations (Clusters of ~1000 servers)
    Costs are ~6‐7X lower than Enterprise Size Installations
    Main cost is hardware and power, people ~5‐10% of total cost
     Scale out over up – more smaller, commodity components
    Copyright©2019 Poppendieck.LLC 14
    * James Hamilton
    LADIS 2008 Keynote
    Photo © Tom Poppendieck

    View Slide

  15. 15
    Copyright©2019
    Poppendieck.LLC
    Architecture
    Redundant (only gets you four 9’s)
    Partitioned (aggressively limit blast radius)
    Decomposed into small services with decoupled deployment
    Monitoring, profiling, debugging hooks at all levels
    Practices
    Heavily instrumented applications to detect failure
    Canary releases
    Failover to replicas / other datacenters
    Bad Backend detection and isolation
    Easy‐to‐use design patterns and abstractions for applications

    View Slide

  16. Ben Treynor joined Google to lead a team of seven software
    engineers running a production environment.
    Site Reliability Engineering:
    “What happens when a software engineer is tasked
    with what used to be called operations.”*
    Goal: Eliminate Toil – work that is manual,
    repetitive, tactical, devoid of enduring value,
    and that scales linearly as the service grows.
    Goal: Pursue maximum change velocity
    without violating service level objectives.
    Copyright©2019 Poppendieck.LLC
    16
    * Ben Treynor, Google VP, head of first SRE’s

    View Slide

  17. Monitoring
    Emergency Response
    Change Management
    Capacity Planning
    Availability
    Latency
    Performance
    Efficiency
    Copyright©2019 Poppendieck.LLC 17
    Photo © Tom Poppendieck * Jim Ostergaard, VP. Operations, Target, Opstoberfest 2018

    View Slide

  18. Copyright©2019 Poppendieck.LLC 18
    The Error Budget – A Systems Engineering Approach
    Photo © Tom Poppendieck

    View Slide

  19. An objective function that is used to summarize how
    close a given design solution is to achieving its set aims.
    Copyright©2019
    Poppendieck.LLC
    19
    Example 2: Cyclic Dependency
    Test that a package dependency cycle does not exist.
    Example 1: Cyclomatic Complexity
    Measure the number of decisions in a set of code.
    Credit: Neal Ford – Evolutionary Architectures
    Photo © Tom Poppendieck

    View Slide

  20. Copyright©2019 Poppendieck.LLC
    20

    View Slide

  21. Chaos Engineering
    21
    Copyright©2019
    Poppendieck.LLC
    Building Confidence in System Behavior through Experiments
    Building Expertise in Emergency Response through Practice
    Photo © Tom Poppendieck

    View Slide

  22. Product (Continuous Integration)
    Ops (Continuous Delivery)
    Architecture (Pipeline Fittness Function)
    Infrastructure (Production Fittness Function)
    Copyright©2019 Poppendieck.LLC 22

    View Slide

  23. Not Responsible for Product
    On Call
    Very Demanding Job
    Fluid Job Description
    Expertise may not be Portable
    Significant Exposure
    Responsible for User Experience
    High Pay (~50% ^ SwEngr)
    High Demand for People
    Expanding New Field
    Broad Skills Required
    Significant Autonomy
    Pros Cons
    Copyright©2019
    Poppendieck.LLC
    23

    View Slide

  24. Reliable Systems
    Trustworthy
    Work as Expected
    Fault Tolerant
    Limited Blast Radius
    Resilient
    Rapid Recovery
    Available
    Scalable Capacity
    Safe
    Do No Harm
    Secure
    Not Vulnerable to Attack
    Durable
    Sustainable Over Time Copyright©2019 Poppendieck.LLC
    24
    Photo © Tom Poppendieck

    View Slide

  25. Photo © Tom Poppendieck
    [email protected]
    www.poppendieck.com
    Mary Poppendieck
    25

    View Slide