Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Five NEINs of Availability

Five NEINs of Availability

We've all been there: it's 3 AM, the system is down, everything is on fire and it's up to us to make it better. We do some digging, deploy a fix and draft a post-mortem. We might even identify some things we could have done differently, or suggest a process to avoid such problems in the future. Everyone sits down for the ceremonial presentation of the post-mortem and nods sagely, going back to their work secure that valuable lessons had been learned... right up until the next time the system crashes and we go through the motions again.

In this session we'll consider not what could be done differently, but what shouldn't be done at all: common engineering antipatterns that, if we fail to avoid, will degrade our system and hurt its availability.

0014decc65763e66f22891be724b5afa?s=128

Tomer Gabel

June 23, 2020
Tweet

More Decks by Tomer Gabel

Other Decks in Technology

Transcript

  1. Five NEINs of Availability Tomer Gabel Cloud Nein, June 2020

  2. Same Old Song and Dance Image by Sarah Giboni on

    Flickr (CC BY 2.0)
  3. Who am I?

  4. MAKE IT IMPOSSIBLE TO BREAK 1. Don’t… Image by Bert

    Heymans on Flickr (CC BY-NC-ND 2.0)
  5. • Human error is inevitable • More process won’t solve

    the problem • More process will screw you over • Avoid the process entirely, or provide means of circumventing it
  6. PUT ARTIFICIAL BARRIERS IN PLACE 2. Don’t… Image by Jessica

    BKK on Flickr (CC BY 2.0)
  7. • Regulation – Apply only where required, and/or – Allow

    access with proper logging/auditing. It’ll end up cheaper • Lack of trust – You trust them to build it, but not to operate it? – Won’t help you anyway Barriers
  8. THINK OF FAILURES AS EXCEPTIONAL 3. Don’t… Image by Piratenmensch

    on Flickr (CC BY-SA 2.0)
  9. Source: AWS case study by Slack

  10. None
  11. None
  12. None
  13. MAKE PROBLEMS GO AWAY 4. Don’t… Top image by chaouki

    on Flickr (CC BY-SA 2.0), bottom image by Caffeinatrix on Flickr (CC BY-NC-ND 2.0)
  14. Source: Imgflip

  15. Panic Reboot Nasty side effects Obvious: • Data loss •

    Data corruption • Interrupted users Subtle: • Partial transactions • Data inconsistency • Abnormal load (e.g. cache warmup)
  16. Panic Reboot Nasty side effects Lost opportunity It’s a unique

    state • Unexpected • Pathological • Visible Collect data! • Thread dumps • Heap dumps • Metrics, logs Act on it!
  17. Panic Reboot Nasty side effects Lost opportunity Bryan Cantrill Debugging

    Microservices in Production, QCon SF 2015
  18. HARASS YOUR DEBUGGERS 5. Don’t… Image by mariana neri on

    Flickr (CC BY-NC-ND 2.0)
  19. There’s an issue! Fix it! WHY ISN’T IT FIXED YET?

    Is it fixed yet? Can you send a status update? Don’t forget to save the screenshots!
  20. In conclusion Don't… 1. Make it impossible to break 2.

    Put artificial barriers in place 3. Think of failures as exceptional 4. Make problems go away 5. Harass your debuggers
  21. In conclusion Don't… 1. Make it impossible to break 2.

    Put artificial barriers in place 3. Think of failures as exceptional 4. Make problems go away 5. Harass your debuggers Do… • Trust your engineers • Assume and plan for failure • Gather evidence before acting • Invest in incident management
  22. QUESTIONS? Thank you for listening tomer@tomergabel.com @tomerg On GitHub: https://github.com/holograph

    This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.