Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Five NEINs of Availability

Five NEINs of Availability

We've all been there: it's 3 AM, the system is down, everything is on fire and it's up to us to make it better. We do some digging, deploy a fix and draft a post-mortem. We might even identify some things we could have done differently, or suggest a process to avoid such problems in the future. Everyone sits down for the ceremonial presentation of the post-mortem and nods sagely, going back to their work secure that valuable lessons had been learned... right up until the next time the system crashes and we go through the motions again.

In this session we'll consider not what could be done differently, but what shouldn't be done at all: common engineering antipatterns that, if we fail to avoid, will degrade our system and hurt its availability.

Tomer Gabel

June 23, 2020
Tweet

More Decks by Tomer Gabel

Other Decks in Technology

Transcript

  1. MAKE IT IMPOSSIBLE TO BREAK 1. Don’t… Image by Bert

    Heymans on Flickr (CC BY-NC-ND 2.0)
  2. • Human error is inevitable • More process won’t solve

    the problem • More process will screw you over • Avoid the process entirely, or provide means of circumventing it
  3. • Regulation – Apply only where required, and/or – Allow

    access with proper logging/auditing. It’ll end up cheaper • Lack of trust – You trust them to build it, but not to operate it? – Won’t help you anyway Barriers
  4. MAKE PROBLEMS GO AWAY 4. Don’t… Top image by chaouki

    on Flickr (CC BY-SA 2.0), bottom image by Caffeinatrix on Flickr (CC BY-NC-ND 2.0)
  5. Panic Reboot Nasty side effects Obvious: • Data loss •

    Data corruption • Interrupted users Subtle: • Partial transactions • Data inconsistency • Abnormal load (e.g. cache warmup)
  6. Panic Reboot Nasty side effects Lost opportunity It’s a unique

    state • Unexpected • Pathological • Visible Collect data! • Thread dumps • Heap dumps • Metrics, logs Act on it!
  7. There’s an issue! Fix it! WHY ISN’T IT FIXED YET?

    Is it fixed yet? Can you send a status update? Don’t forget to save the screenshots!
  8. In conclusion Don't… 1. Make it impossible to break 2.

    Put artificial barriers in place 3. Think of failures as exceptional 4. Make problems go away 5. Harass your debuggers
  9. In conclusion Don't… 1. Make it impossible to break 2.

    Put artificial barriers in place 3. Think of failures as exceptional 4. Make problems go away 5. Harass your debuggers Do… • Trust your engineers • Assume and plan for failure • Gather evidence before acting • Invest in incident management
  10. QUESTIONS? Thank you for listening [email protected] @tomerg On GitHub: https://github.com/holograph

    This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.