Reliability Engineering - Mary Poppendieck

The days of trying to prevent failures are gone. In today’s high-volume, cloud-based systems, anything that can go wrong will eventually go wrong. It is far better to spend our time engineering fault tolerance than pursuing the impossible goal of fault prevention. Not only is reliability engineering one of the highest paying jobs in software engineering today, it is also a job full of unique challenges that demand creative thinking and problem solving.

This talk is about the multiple aspects of reliability engineering that have become critically important as our world has become increasingly dependent on software systems.

DevOpsDays Zurich

May 14, 2019

  1. Bell Telephone Laboratories: No. 2 ESS Electronic Switching System for

    Community Exchanges Design goal: Cost‐to‐Install and Reliability equivalent to existing electromechanical systems => Reliability goal: Maximum 2 hours downtime in 40 years Copyright©2019 Poppendieck.LLC 8
  2. 1: Detect Mismatch. 2: Determine and isolate faulty unit. 3:

    Run diagnostics to isolate fault to a small number of components. 4: Print out repair notification. 5: Resume normal operation upon repair. Copyright©2019 Poppendieck.LLC
  3. In 1990, more than half of AT&T’s network crashed after

    a switch at a switching center suffered a minor problem and shut down the center. When the center came back up, it sent a message to other centers, causing them to trip, shut down and reset. This continued for 9 hours! Copyright©2019 Poppendieck.LLC 11 Redundancy Isolation
  4. A Case for Redundant Arrays of Inexpensive Disks (RAID) David

    A Patterson, Garth Gibson, and Randy H Katz Copyright©2019 Poppendieck.LLC 12 ACM SIGMOD Conference, June 1988 Photo: Creative Commons
  5. Jeff Dean and Sanjay Ghemawat join Google (from DEC). Responsible

    for core infrastructure. Copyright©2019 Poppendieck.LLC 13 Computer History Museum Google Hardware Running BigTable Jeff Dean Sanjay Ghemawat “Ultimately, it was this frustration of being one level removed from real users using my work that led me to go to a startup.” …Jeff Dean Google File System and MapReduce * Jeff Dean LADIS 2009 Keynote
  6. Enterprise Size Installations (~1000 servers) Main cost is people; roughly

    1 person : 100 servers # of people grows linearly with servers  Consolidate work onto fewer, larger systems Internet scale installations (Clusters of ~1000 servers) Costs are ~6‐7X lower than Enterprise Size Installations Main cost is hardware and power, people ~5‐10% of total cost  Scale out over up – more smaller, commodity components Copyright©2019 Poppendieck.LLC 14 * James Hamilton LADIS 2008 Keynote Photo © Tom Poppendieck
  7. 15 Copyright©2019 Poppendieck.LLC Architecture Redundant (only gets you four 9’s)

    Partitioned (aggressively limit blast radius) Decomposed into small services with decoupled deployment Monitoring, profiling, debugging hooks at all levels Practices Heavily instrumented applications to detect failure Canary releases Failover to replicas / other datacenters Bad Backend detection and isolation Easy‐to‐use design patterns and abstractions for applications
  8. Ben Treynor joined Google to lead a team of seven

    software engineers running a production environment. Site Reliability Engineering: “What happens when a software engineer is tasked with what used to be called operations.”* Goal: Eliminate Toil – work that is manual, repetitive, tactical, devoid of enduring value, and that scales linearly as the service grows. Goal: Pursue maximum change velocity without violating service level objectives. Copyright©2019 Poppendieck.LLC 16 * Ben Treynor, Google VP, head of first SRE’s
  9. Monitoring Emergency Response Change Management Capacity Planning Availability Latency Performance

    Efficiency Copyright©2019 Poppendieck.LLC 17 Photo © Tom Poppendieck * Jim Ostergaard, VP. Operations, Target, Opstoberfest 2018
  10. An objective function that is used to summarize how close

    a given design solution is to achieving its set aims. Copyright©2019 Poppendieck.LLC 19 Example 2: Cyclic Dependency Test that a package dependency cycle does not exist. Example 1: Cyclomatic Complexity Measure the number of decisions in a set of code. Credit: Neal Ford – Evolutionary Architectures Photo © Tom Poppendieck
  11. Chaos Engineering 21 Copyright©2019 Poppendieck.LLC Building Confidence in System Behavior

    through Experiments Building Expertise in Emergency Response through Practice Photo © Tom Poppendieck
  12. Product (Continuous Integration) Ops (Continuous Delivery) Architecture (Pipeline Fittness Function)

    Infrastructure (Production Fittness Function) Copyright©2019 Poppendieck.LLC 22
  13. Not Responsible for Product On Call Very Demanding Job Fluid

    Job Description Expertise may not be Portable Significant Exposure Responsible for User Experience High Pay (~50% ^ SwEngr) High Demand for People Expanding New Field Broad Skills Required Significant Autonomy Pros Cons Copyright©2019 Poppendieck.LLC 23
  14. Reliable Systems Trustworthy Work as Expected Fault Tolerant Limited Blast

    Radius Resilient Rapid Recovery Available Scalable Capacity Safe Do No Harm Secure Not Vulnerable to Attack Durable Sustainable Over Time Copyright©2019 Poppendieck.LLC 24 Photo © Tom Poppendieck