Five NEINs
of Availability
Tomer Gabel
Cloud Nein, June 2020
Slide 2
Slide 2 text
Same Old Song and Dance
Image by Sarah Giboni on Flickr (CC BY 2.0)
Slide 3
Slide 3 text
Who am I?
Slide 4
Slide 4 text
MAKE IT IMPOSSIBLE TO BREAK
1. Don’t…
Image by Bert Heymans on Flickr (CC BY-NC-ND 2.0)
Slide 5
Slide 5 text
• Human error is inevitable
• More process won’t solve the problem
• More process will screw you over
• Avoid the process entirely, or provide
means of circumventing it
Slide 6
Slide 6 text
PUT ARTIFICIAL BARRIERS IN PLACE
2. Don’t…
Image by Jessica BKK on Flickr (CC BY 2.0)
Slide 7
Slide 7 text
• Regulation
– Apply only where required, and/or
– Allow access with proper logging/auditing. It’ll
end up cheaper
• Lack of trust
– You trust them to build it, but not to operate it?
– Won’t help you anyway
Barriers
Slide 8
Slide 8 text
THINK OF
FAILURES AS
EXCEPTIONAL
3. Don’t…
Image by Piratenmensch on Flickr (CC BY-SA 2.0)
Slide 9
Slide 9 text
Source: AWS case study by Slack
Slide 10
Slide 10 text
No content
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
MAKE PROBLEMS
GO AWAY
4. Don’t…
Top image by chaouki on Flickr (CC BY-SA 2.0), bottom image by Caffeinatrix on Flickr (CC BY-NC-ND 2.0)
Slide 14
Slide 14 text
Source: Imgflip
Slide 15
Slide 15 text
Panic Reboot
Nasty side
effects
Obvious:
• Data loss
• Data corruption
• Interrupted users
Subtle:
• Partial transactions
• Data inconsistency
• Abnormal load
(e.g. cache warmup)
Slide 16
Slide 16 text
Panic Reboot
Nasty side
effects
Lost
opportunity
It’s a unique state
• Unexpected
• Pathological
• Visible
Collect data!
• Thread dumps
• Heap dumps
• Metrics, logs
Act on it!
Slide 17
Slide 17 text
Panic Reboot
Nasty side
effects
Lost
opportunity
Bryan Cantrill
Debugging Microservices in
Production, QCon SF 2015
Slide 18
Slide 18 text
HARASS YOUR
DEBUGGERS
5. Don’t…
Image by mariana neri on Flickr (CC BY-NC-ND 2.0)
Slide 19
Slide 19 text
There’s an
issue! Fix it!
WHY ISN’T
IT FIXED
YET?
Is it
fixed yet?
Can you send a
status update?
Don’t forget to
save the
screenshots!
Slide 20
Slide 20 text
In conclusion
Don't…
1. Make it impossible to break
2. Put artificial barriers in place
3. Think of failures as exceptional
4. Make problems go away
5. Harass your debuggers
Slide 21
Slide 21 text
In conclusion
Don't…
1. Make it impossible to break
2. Put artificial barriers in place
3. Think of failures as exceptional
4. Make problems go away
5. Harass your debuggers
Do…
• Trust your engineers
• Assume and plan for failure
• Gather evidence before acting
• Invest in incident management
Slide 22
Slide 22 text
QUESTIONS?
Thank you for listening
[email protected]
@tomerg
On GitHub:
https://github.com/holograph This work is licensed under a Creative
Commons Attribution-ShareAlike 4.0
International License.