Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Planning for the Outages You're Going to Cause

Ramin K
November 20, 2019

Planning for the Outages You're Going to Cause

You're going to cause an outage. It's cool. So let's have a plan. This talk discusses common ways to reduce the severity and scope of outages.

Ramin K

November 20, 2019
Tweet

More Decks by Ramin K

Other Decks in Technology

Transcript

  1. “Sometime I like to tell people how they should live.

    I think because I know a few secrets I have the answers To everything, and that’s not true. But sometimes the people I speak to in my overbearing way listen, tilting Their heads a little, implying they’re not ready yet To take my talk at face value. I might manage To corner somebody most of an afternoon; ” Lisa Lewis (poet) “While I’m Walking”
  2. Second Law of Thermodynamics Murphy’s Law Chaos Theory / Emergent

    (Mis)Behavior Sidney Dekker / Erik Hollnagel (books!) It’s a guarantee. Failure is not an option,
  3. “We can't impose our will on a system. We can

    listen to what the system tells us, and discover how its properties and our values can work together to bring forth something much better than could ever be produced by our will alone.” Donella Meadows (PhD) Thinking in Systems: A Primer
  4. Outages of commission are interesting. Outages of omission are maint

    problems. Standing still is a existential problem to solve Boundary Conditions
  5. Frontend Backend, API, cache Infrastructure (I work here) PG&E Foundational

    services cause larger outages Or why I left network engineering
  6. Zen Release Advice: In order to get good at releases,

    you must first release. And then release some more. That’s it, you can leave now Practice makes perfect
  7. small incremental changes are: • Easier to reason about •

    Easier to test • Easier to release (with work) Less human intensive too Small batch size is best batch size
  8. One person on an eight person team is not an

    unreasonable investment in reducing length and scope of incidents. Wait, good releases are a feature Engineering, not just for features
  9. The key here is communication. Docs, emails, roadmaps, tickets, office

    hours, internal presentations are all potential tools. Tell the org what they’re getting, repeatedly Pre and post press release your release
  10. “The current goal is $x. My change is $y. I

    think $z is our most likely failure. Do you agree? What else am I missing?” Make sure everyone knows that Looking for failure is important
  11. Do we have signal? For both failure and success? we

    must have been successful No one complained,
  12. It takes time to build processes, releasable software, and teamwork.

    Don’t rush it. Don’t dawdle either. Go forth and have your least worst outage Summary