Reliability Engineering - Mary Poppendieck

www.poppendieck.com [email protected] Mary Poppendieck Photo © Tom Poppendieck Copyright©2019 Poppendieck.LLC
1

Bell Telephone Laboratories: No. 2 ESS Electronic Switching System for
Community Exchanges Design goal: Cost‐to‐Install and Reliability equivalent to existing electromechanical systems => Reliability goal: Maximum 2 hours downtime in 40 years Copyright©2019 Poppendieck.LLC 8

1: Detect Mismatch. 2: Determine and isolate faulty unit. 3:
Run diagnostics to isolate fault to a small number of components. 4: Print out repair notification. 5: Resume normal operation upon repair. Copyright©2019 Poppendieck.LLC

In 1990, more than half of AT&T’s network crashed after
a switch at a switching center suffered a minor problem and shut down the center. When the center came back up, it sent a message to other centers, causing them to trip, shut down and reset. This continued for 9 hours! Copyright©2019 Poppendieck.LLC 11 Redundancy Isolation

A Case for Redundant Arrays of Inexpensive Disks (RAID) David
A Patterson, Garth Gibson, and Randy H Katz Copyright©2019 Poppendieck.LLC 12 ACM SIGMOD Conference, June 1988 Photo: Creative Commons

Jeff Dean and Sanjay Ghemawat join Google (from DEC). Responsible
for core infrastructure. Copyright©2019 Poppendieck.LLC 13 Computer History Museum Google Hardware Running BigTable Jeff Dean Sanjay Ghemawat “Ultimately, it was this frustration of being one level removed from real users using my work that led me to go to a startup.” …Jeff Dean Google File System and MapReduce * Jeff Dean LADIS 2009 Keynote

Enterprise Size Installations (~1000 servers) Main cost is people; roughly
1 person : 100 servers # of people grows linearly with servers  Consolidate work onto fewer, larger systems Internet scale installations (Clusters of ~1000 servers) Costs are ~6‐7X lower than Enterprise Size Installations Main cost is hardware and power, people ~5‐10% of total cost  Scale out over up – more smaller, commodity components Copyright©2019 Poppendieck.LLC 14 * James Hamilton LADIS 2008 Keynote Photo © Tom Poppendieck

15 Copyright©2019 Poppendieck.LLC Architecture Redundant (only gets you four 9’s)
Partitioned (aggressively limit blast radius) Decomposed into small services with decoupled deployment Monitoring, profiling, debugging hooks at all levels Practices Heavily instrumented applications to detect failure Canary releases Failover to replicas / other datacenters Bad Backend detection and isolation Easy‐to‐use design patterns and abstractions for applications

Ben Treynor joined Google to lead a team of seven
software engineers running a production environment. Site Reliability Engineering: “What happens when a software engineer is tasked with what used to be called operations.”* Goal: Eliminate Toil – work that is manual, repetitive, tactical, devoid of enduring value, and that scales linearly as the service grows. Goal: Pursue maximum change velocity without violating service level objectives. Copyright©2019 Poppendieck.LLC 16 * Ben Treynor, Google VP, head of first SRE’s

Monitoring Emergency Response Change Management Capacity Planning Availability Latency Performance
Efficiency Copyright©2019 Poppendieck.LLC 17 Photo © Tom Poppendieck * Jim Ostergaard, VP. Operations, Target, Opstoberfest 2018

Copyright©2019 Poppendieck.LLC 18 The Error Budget – A Systems Engineering
Approach Photo © Tom Poppendieck

An objective function that is used to summarize how close
a given design solution is to achieving its set aims. Copyright©2019 Poppendieck.LLC 19 Example 2: Cyclic Dependency Test that a package dependency cycle does not exist. Example 1: Cyclomatic Complexity Measure the number of decisions in a set of code. Credit: Neal Ford – Evolutionary Architectures Photo © Tom Poppendieck

Not Responsible for Product On Call Very Demanding Job Fluid
Job Description Expertise may not be Portable Significant Exposure Responsible for User Experience High Pay (~50% ^ SwEngr) High Demand for People Expanding New Field Broad Skills Required Significant Autonomy Pros Cons Copyright©2019 Poppendieck.LLC 23

Reliable Systems Trustworthy Work as Expected Fault Tolerant Limited Blast
Radius Resilient Rapid Recovery Available Scalable Capacity Safe Do No Harm Secure Not Vulnerable to Attack Durable Sustainable Over Time Copyright©2019 Poppendieck.LLC 24 Photo © Tom Poppendieck

Reliability Engineering - Mary Poppendieck

Reliability Engineering - Mary Poppendieck

DevOpsDays Zurich

More Decks by DevOpsDays Zurich

Other Decks in Technology

Featured

Transcript

www.poppendieck.com [email protected] Mary Poppendieck Photo © Tom Poppendieck Copyright©2019 Poppendieck.LLC

Copyright©2019 Poppendieck.LLC 2

Copyright©2019 Poppendieck.LLC 3

Community Dial Office Credit: Connections Museum Seattle Copyright©2019 Poppendieck.LLC 4

Copyright©2019 Poppendieck.LLC 5 Creative Commons

Copyright©2019 Poppendieck.LLC 6

7 Copyright©2019 Poppendieck.LLC

Bell Telephone Laboratories: No. 2 ESS Electronic Switching System for

Copyright©2019 Poppendieck.LLC 9 Redundancy Isolation

1: Detect Mismatch. 2: Determine and isolate faulty unit. 3:

In 1990, more than half of AT&T’s network crashed after

A Case for Redundant Arrays of Inexpensive Disks (RAID) David

Jeff Dean and Sanjay Ghemawat join Google (from DEC). Responsible

Enterprise Size Installations (~1000 servers) Main cost is people; roughly

15 Copyright©2019 Poppendieck.LLC Architecture Redundant (only gets you four 9’s)

Ben Treynor joined Google to lead a team of seven

Monitoring Emergency Response Change Management Capacity Planning Availability Latency Performance

Copyright©2019 Poppendieck.LLC 18 The Error Budget – A Systems Engineering

An objective function that is used to summarize how close

Copyright©2019 Poppendieck.LLC 20

Chaos Engineering 21 Copyright©2019 Poppendieck.LLC Building Confidence in System Behavior

Product (Continuous Integration) Ops (Continuous Delivery) Architecture (Pipeline Fittness Function)

Not Responsible for Product On Call Very Demanding Job Fluid

Reliable Systems Trustworthy Work as Expected Fault Tolerant Limited Blast

Photo © Tom Poppendieck [email protected] www.poppendieck.com Mary Poppendieck 25