Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Immutable Infrastructure: On Misnomers, Zoology, and Pre-accidents

Immutable Infrastructure: On Misnomers, Zoology, and Pre-accidents

You’ve written some code, run its tests, bundled an artifact, and shipped it. Sadly, as soon as it serves its first request, live traffic/configuration drift/a partial deployment failure reveals a production-impacting issue. Oops.

This talk discusses immutable infrastructure. It touches on deployment patterns that encourage replacement over modification, while exploring the accuracy of terms, the pets vs. cattle metaphor, and safety properties inherent to its practice.

Anthony M.

August 09, 2016
Tweet

Other Decks in Technology

Transcript

  1. Today’s Adventure 1. Misnomers A Series of Lies 2. Zoology

    Pets, Cattle, and Majestic Birds 3. Pre-accidents Pre-crash Safety and Triggering Weakness
  2. 1. Misnomers A Series of Lies 2. Zoology Pets, Cattle,

    and Majestic Birds 3. Pre-accidents Pre-crash Safety and Triggering Weakness
  3. [1] pry(main)> AnNumber = Struct.new(:value) => AnNumber [2] pry(main)> one

    = AnNumber.new(1) => #<struct AnNumber value=1> [3] pry(main)> two = AnNumber.new(2) => #<struct AnNumber value=2> [4] pry(main)> one == two => false [5] pry(main)> one.value = 2 => 2 [6] pry(main)> one == two => true ಥ_ಥ
  4. Immutable Infrastructure • CPU • Memory • Disk • Network

    Fetch instructions Store values Read or write files Transmit data
  5. 1. Misnomers A Series of Lies 2. Zoology Pets, Cattle,

    and Majestic Birds 3. Pre-accidents Pre-crash Safety and Triggering Weakness
  6. Pets vs. Livestock • Unique • Enduring • Diagnosis and

    care • Lifespan in years • Indistinguishable • Transitory • Economies of scale • Lifespan in months Traditional vs. Cloud days hours minutes
  7. 1. Misnomers A Series of Lies 2. Zoology Pets, Cattle,

    and Majestic Birds 3. Pre-accidents Pre-crash Safety and Triggering Weakness
  8. 42% “System administration, which includes operator actions, system configuration, and

    system maintenance, was the main source of failures -- 42%.” http://research.microsoft.com/en-us/um/people/gray/papers/TandemTR85.7_WhyDoComputersStop.pdf
  9. Pre-accidents • Operators don’t cause failure, but trigger weakness •

    Weakness lurks everywhere • Discovering/evaluating weakness emboldens safety • Organization-wide collaboration is key • Learning, blamelessness, consistency Local context Understand weakness Learn warning signs Intervene
  10. “When these problems emerge in the pre-crash phase, the time

    window for attempting a crash avoidance maneuver is normally very small.” http://www-nrd.nhtsa.dot.gov/Pubs/811617.pdf
  11. “And that’s the moment when the idea of a ‘premortem’

    for high-explosive experimentation was born.” From the foreward of Pre-Accident Investigations: An Introduction to Organizational Safety
  12. • Each phase is independent • Additive and destructive actions

    on disposable resources • Gradated exposure of the artifact ◦ Ratio-based load balancing ◦ Weighted DNS records Gradation Observability Rollback service.app.com App v2 App v1 90% 10%
  13. • Gradation begets observability • Smoke test without impacting service

    • Blue-green with real user traffic • Evaluate metrics from two versions side-by-side Gradation Observability Rollback
  14. • Reduced to the modification of a pointer Gradation Observability

    Rollback service.app.com App v2 App v1 100% 10%
  15. Summary • Immutability simplifies understanding, reasoning • Deep understanding enables

    pre-mortem practices • Both combine to provide safety guarantees, remediation • Not without caveats -- statefulness, dependencies, etc.
  16. Further Reading • Pre-Accident Investigations: An Introduction to Organizational Safety

    by Todd Conklin • Promise Theory: Principles and Applications by Jan A. Bergstra and Mark Burgess • The Human Side of Postmortems by David Zwieback • Why Do Computers Stop and What Can Be Done About It? by Jim Gray • Computer Immunology by Mark Burgess • How Complex Systems Fail by Richard Cook • Understanding and Dealing w/ Operator Mistakes in Internet Services by Nagaraja et. al. • Building Robust Systems by Gerald Jay Sussman • Do Not Blame Users for Misconfigurations by Xu et. al. • Literally any public post-mortem, human performance review, and system safety evaluation you can get your hands on
  17. Further Listening • The Pre-Accident Podcast • The SafetyStratus Podcast

    • Food Fight Show • The Ship Show • Arrested DevOps • Cloud-Native After Dark • HangOps