@daschl
#Voxxed
The Walking Dead
A Survival Guide to Resilient Applications
Michael Nitschinger
Slide 2
Slide 2 text
the right
Mindset
2
Slide 3
Slide 3 text
– U.S. Marine Corps
“The more you sweat in peace, the less
you bleed in war.”
3
Slide 4
Slide 4 text
4
Slide 5
Slide 5 text
5
Slide 6
Slide 6 text
Not so fast, mister fancy tests!
6
Slide 7
Slide 7 text
What can go wrong?
Always ask yourself
7
Slide 8
Slide 8 text
Fault Tolerance
101
8
Slide 9
Slide 9 text
Fault Error Failure
A fault is a latent defect that can cause an
error when activated.
9
Slide 10
Slide 10 text
Fault Error Failure
Errors are the manifestations of faults.
10
Slide 11
Slide 11 text
Fault Error Failure
Failure occurs when the service no longer
complies with its specifications.
11
Slide 12
Slide 12 text
Fault Error Failure
Errors are inevitable. We need to
detect, recover and mitigate
them before they become failures.
12
Slide 13
Slide 13 text
Reliability
is the probability that a system will perform
failure free for a given amount of time.
MTTF Mean Time To Failure
MTTR Mean Time To Repair
13
Slide 14
Slide 14 text
Availability
is the percentage of time the system is able to
perform its function.
availability =
MTTF
MTTF + MTTR
14
Slide 15
Slide 15 text
Expression Downtime/Year
Three 9s 99.9% 525.6 min
Four 9s 99.99% 52.56 min
Four 9s and a 5 99.995% 26.28 min
Five 9s 99.999% 5.256 min
Six 9s 99.9999% 0.5256 min
100% 0
15
Slide 16
Slide 16 text
Pop Quiz!
Edge Service
User Service Session Store Data Warehouse
Wanted: 99.99% Availability
??? ??? ???
16
Slide 17
Slide 17 text
Pop Quiz!
Edge Service
User Service Session Store Data Warehouse
Wanted: 99.99% Availability
99.999% 99.999% 99.999%
17
Slide 18
Slide 18 text
Fault Tolerant
Architecture
18
Slide 19
Slide 19 text
Units of Mitigation
are the basic units of
error containment and recovery.
19
Slide 20
Slide 20 text
20
Slide 21
Slide 21 text
Redundancy
Cost
Active/Active Active/Standby N+M
Cost Time To Recover
21
Slide 22
Slide 22 text
Escalation
is used when recovery or mitigation
is not possible inside the unit.
22
Slide 23
Slide 23 text
Escalation
taken from http://letitcrash.com/post/30165507578/shutdown-patterns-in-akka-2
23
Slide 24
Slide 24 text
The Fault Observer
receives system and error events and
can guide and orchestrate detection and recovery
Unit
Unit
Observer
Listener
Listener
Unit
Unit
24
Slide 25
Slide 25 text
25
Slide 26
Slide 26 text
26
Slide 27
Slide 27 text
Detecting
Errors
27
Slide 28
Slide 28 text
A silent system
is a dead system.
28
Slide 29
Slide 29 text
A System Monitor
helps to study behaviour and to
make sure it is operating as specified.
http://upload.wikimedia.org/wikipedia/commons/3/3b/Mission_control_center.jpg
29
Slide 30
Slide 30 text
https://github.com/Netflix/Turbine
30
Slide 31
Slide 31 text
Periodic Checking
Heartbeats monitor tasks or remote services
and initiate recovery
Routine Exercises prevent idle
unit starvation and surface malfunctions
31
Slide 32
Slide 32 text
Utilizing Netty’s IdleStateHandler
32
Slide 33
Slide 33 text
Riding over Transients
is used to defer error recovery
if the error is temporary.
“‘Patience is a virtue’ to allow the true signature of
an error to show itself.”
- Robert S. Hanmer
33
Failover
to a redundant unit when the error has been
detected and isolated.
Cost
Active/Active Active/Standby N+M
Cost Time To Recover
Redundancy
Reminder
37
Slide 38
Slide 38 text
Intelligent Retries
Time between Retries
Number of Attempts
Fixed Linear Exponential
38
Slide 39
Slide 39 text
Restart
can be used as a last resort with the
trade-off to lose state and time.
39
Slide 40
Slide 40 text
Fail Fast
to shed load and give a partial great service
than a complete bad one.
Boundary
40
Slide 41
Slide 41 text
Backpressure
& Batching!
41
Slide 42
Slide 42 text
Case Study: Hystrix
https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png
42
Slide 43
Slide 43 text
And more!
• Rollback
• Roll-Forward
• Checkpoints
• Data Reset
Recovery Mitigation
• Bounded Queuing
• Expansive Controls
• Marking Data
• Error Correcting Codes
43
Slide 44
Slide 44 text
And more!
• Rollback
• Roll-Forward
• Checkpoints
• Data Reset
Recovery Mitigation
• Bounded Queuing
• Expansive Controls
• Marking Data
• Error Correcting Codes
44
Slide 45
Slide 45 text
Watch it in
Action
45
Slide 46
Slide 46 text
Recommended
Reading
46
Slide 47
Slide 47 text
Patterns for
Fault-Tolerant Software
by Robert S. Hanmer
47