Slide 1

Slide 1 text

Five NEINs of Availability Tomer Gabel Cloud Nein, June 2020

Slide 2

Slide 2 text

Same Old Song and Dance Image by Sarah Giboni on Flickr (CC BY 2.0)

Slide 3

Slide 3 text

Who am I?

Slide 4

Slide 4 text

MAKE IT IMPOSSIBLE TO BREAK 1. Don’t… Image by Bert Heymans on Flickr (CC BY-NC-ND 2.0)

Slide 5

Slide 5 text

• Human error is inevitable • More process won’t solve the problem • More process will screw you over • Avoid the process entirely, or provide means of circumventing it

Slide 6

Slide 6 text

PUT ARTIFICIAL BARRIERS IN PLACE 2. Don’t… Image by Jessica BKK on Flickr (CC BY 2.0)

Slide 7

Slide 7 text

• Regulation – Apply only where required, and/or – Allow access with proper logging/auditing. It’ll end up cheaper • Lack of trust – You trust them to build it, but not to operate it? – Won’t help you anyway Barriers

Slide 8

Slide 8 text

THINK OF FAILURES AS EXCEPTIONAL 3. Don’t… Image by Piratenmensch on Flickr (CC BY-SA 2.0)

Slide 9

Slide 9 text

Source: AWS case study by Slack

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

MAKE PROBLEMS GO AWAY 4. Don’t… Top image by chaouki on Flickr (CC BY-SA 2.0), bottom image by Caffeinatrix on Flickr (CC BY-NC-ND 2.0)

Slide 14

Slide 14 text

Source: Imgflip

Slide 15

Slide 15 text

Panic Reboot Nasty side effects Obvious: • Data loss • Data corruption • Interrupted users Subtle: • Partial transactions • Data inconsistency • Abnormal load (e.g. cache warmup)

Slide 16

Slide 16 text

Panic Reboot Nasty side effects Lost opportunity It’s a unique state • Unexpected • Pathological • Visible Collect data! • Thread dumps • Heap dumps • Metrics, logs Act on it!

Slide 17

Slide 17 text

Panic Reboot Nasty side effects Lost opportunity Bryan Cantrill Debugging Microservices in Production, QCon SF 2015

Slide 18

Slide 18 text

HARASS YOUR DEBUGGERS 5. Don’t… Image by mariana neri on Flickr (CC BY-NC-ND 2.0)

Slide 19

Slide 19 text

There’s an issue! Fix it! WHY ISN’T IT FIXED YET? Is it fixed yet? Can you send a status update? Don’t forget to save the screenshots!

Slide 20

Slide 20 text

In conclusion Don't… 1. Make it impossible to break 2. Put artificial barriers in place 3. Think of failures as exceptional 4. Make problems go away 5. Harass your debuggers

Slide 21

Slide 21 text

In conclusion Don't… 1. Make it impossible to break 2. Put artificial barriers in place 3. Think of failures as exceptional 4. Make problems go away 5. Harass your debuggers Do… • Trust your engineers • Assume and plan for failure • Gather evidence before acting • Invest in incident management

Slide 22

Slide 22 text

QUESTIONS? Thank you for listening [email protected] @tomerg On GitHub: https://github.com/holograph This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.