Michael Nitschinger | Couchbase, Inc.
The Walking Dead
A Survival Guide to Reactive Resilient Applications
Slide 2
Slide 2 text
the right
Mindset
2
Slide 3
Slide 3 text
– U.S. Marine Corps
“The more you sweat in peace, the less
you bleed in war.”
3
Slide 4
Slide 4 text
4
Slide 5
Slide 5 text
5
Slide 6
Slide 6 text
Not so fast, mister fancy tests!
6
Slide 7
Slide 7 text
What can go wrong?
Always ask yourself
7
Slide 8
Slide 8 text
Fault Tolerance
101
8
Slide 9
Slide 9 text
Fault Error Failure
A fault is a latent defect that can cause an
error when activated.
9
Slide 10
Slide 10 text
Fault Error Failure
Errors are the manifestations of faults.
10
Slide 11
Slide 11 text
Fault Error Failure
Failure occurs when the service no longer
complies with its specifications.
11
Slide 12
Slide 12 text
Fault Error Failure
Errors are inevitable. We need to
detect, recover and mitigate
them before they become failures.
12
Slide 13
Slide 13 text
Reliability
is the probability that a system will perform
failure free for a given amount of time.
MTTF Mean Time To Failure
MTTR Mean Time To Repair
13
Slide 14
Slide 14 text
Availability
is the percentage of time the system is able to
perform its function.
availability =
MTTF
MTTF + MTTR
14
Slide 15
Slide 15 text
Expression Downtime/Year
Three 9s 99.9% 525.6 min
Four 9s 99.99% 52.56 min
Four 9s and a 5 99.995% 26.28 min
Five 9s 99.999% 5.256 min
Six 9s 99.9999% 0.5256 min
100% 0
15
Slide 16
Slide 16 text
Pop Quiz!
Edge Service
User Service Session Store Data Warehouse
Wanted: 99.99% Availability
??? ??? ???
16
Slide 17
Slide 17 text
Pop Quiz!
Edge Service
User Service Session Store Data Warehouse
Wanted: 99.99% Availability
99.99%
17
99.99% 99.99%
Slide 18
Slide 18 text
Pop Quiz!
Edge Service
User Service Session Store Data Warehouse
Wanted: 99.99% Availability
~99.999% ~99.999% ~99.999%
18
Slide 19
Slide 19 text
Fault Tolerant
Architecture
19
Slide 20
Slide 20 text
Units of Mitigation
are the basic units of
error containment and recovery.
20
Slide 21
Slide 21 text
Escalation
is used when recovery or mitigation
is not possible inside the unit.
21
Slide 22
Slide 22 text
Escalation
22
Cluster
Node Node
Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Slide 23
Slide 23 text
Escalation
23
Cluster
Node Node
Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Slide 24
Slide 24 text
Escalation
24
Cluster
Node Node
Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Slide 25
Slide 25 text
Escalation
25
Cluster
Node Node
Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Slide 26
Slide 26 text
Redundancy
Cost
Active/Active Active/Standby N+M
Active/Passive
Cost Time To Recover
26
Slide 27
Slide 27 text
The Fault Observer
receives system and error events and
can guide and orchestrate detection and recovery
Unit
Unit
Observer
Listener
Listener
Unit
Unit
27
Slide 28
Slide 28 text
28
Slide 29
Slide 29 text
29
Slide 30
Slide 30 text
Detecting
Errors
30
Slide 31
Slide 31 text
A silent system
is a dead system.
31
Slide 32
Slide 32 text
A System Monitor
helps to study behaviour and to
make sure it is operating as specified.
32
http://cdn-www.airliners.net/aviation-photos/photos/9/2/1/0982129.jpg
Slide 33
Slide 33 text
https://github.com/Netflix/Turbine
33
Slide 34
Slide 34 text
Periodic Checking
Heartbeats monitor tasks or remote services
and initiate recovery
Routine Exercises prevent idle
unit starvation and surface malfunctions
34
Slide 35
Slide 35 text
35
Encoder(
Encoder(
Ne*y(
Writes(
Ne*y(
Reads(
Decoder(
Decoder(
Event on Idle
No Traffic
Endpoint
Slide 36
Slide 36 text
Riding over Transients
is used to defer error recovery
if the error is temporary.
“‘Patience is a virtue’ to allow the true signature of
an error to show itself.”
- Robert S. Hanmer
36
Timeout
to not wait forever and keep
holding up the resource.
40
X
Slide 41
Slide 41 text
Failover
to a redundant unit when the error has been
detected and isolated.
Cost
Active/Active Active/Standby N+M
Cost Time To Recover
Redundancy
Reminder
41
Slide 42
Slide 42 text
Intelligent Retries
Time between Retries
Number of Attempts
Fixed Linear Exponential
42
Slide 43
Slide 43 text
Restart
can be used as a last resort with the
trade-off to lose state and time.
43
Slide 44
Slide 44 text
Fail Fast
to shed load and give a partial great service
than a complete bad one.
Boundary
44
Slide 45
Slide 45 text
Backpressure
& Batching!
45
Slide 46
Slide 46 text
Case Study: Hystrix
https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png
46
Slide 47
Slide 47 text
And more!
• Rollback
• Roll-Forward
• Checkpoints
• Data Reset
Recovery Mitigation
• Bounded Queuing
• Expansive Controls
• Marking Data
• Error Correcting Codes
47
Slide 48
Slide 48 text
And more!
• Rollback
• Roll-Forward
• Checkpoints
• Data Reset
Recovery Mitigation
• Bounded Queuing
• Expansive Controls
• Marking Data
• Error Correcting Codes
48
Slide 49
Slide 49 text
Recommended
Reading
49
Slide 50
Slide 50 text
Patterns for
Fault-Tolerant Software
by Robert S. Hanmer
50
Slide 51
Slide 51 text
Release It!
by Michael T. Nygard
51
Slide 52
Slide 52 text
Announcement
CB Server 4.0 dp!
52
http://blog.couchbase.com/introducing-developer-preview-for-couchbase-server-4.0