Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The Walking Dead - A Survival Guide to Resilien...
Search
Michael Nitschinger
May 12, 2015
Programming
0
200
The Walking Dead - A Survival Guide to Resilient Reactive Applications
I gave this talk at GeeCon 2015 in Krakow. Recording will be available through the GeeCon channels.
Michael Nitschinger
May 12, 2015
Tweet
Share
More Decks by Michael Nitschinger
See All by Michael Nitschinger
High Performance JVM Networking with Netty
daschl
5
1.2k
Reactive Data Access with RxJava... and N1QL!
daschl
0
180
Spark with Couchbase
daschl
0
150
Reactive Data Access with RxJava ... and N1QL!
daschl
0
170
State of the Art JVM Networking with Netty
daschl
2
440
The Walking Dead - A Survival Guide to Resilient Reactive Applications
daschl
0
360
The Walking Dead - A Survival Guide to Resilient Reactive Applications
daschl
1
430
The Walking Dead - A Survival Guide to Resilient Applications
daschl
0
1.3k
Building a Reactive Database Driver on the JVM
daschl
2
940
Other Decks in Programming
See All in Programming
管你要 trace 什麼、bpftrace 用下去就對了 — COSCUP 2025
shunghsiyu
0
430
書き捨てではなく継続開発可能なコードをAIコーディングエージェントで書くために意識していること
shuyakinjo
1
290
GitHub Copilotの全体像と活用のヒント AI駆動開発の最初の一歩
74th
8
3k
Flutterと Vibe Coding で個人開発!
hyshu
1
260
生成AI、実際どう? - ニーリーの場合
nealle
0
140
Amazon Q CLI開発で学んだAIコーディングツールの使い方
licux
3
190
Dart 参戦!!静的型付き言語界の隠れた実力者
kno3a87
0
200
CLI ツールを Go ライブラリ として再実装する理由 / Why reimplement a CLI tool as a Go library
ktr_0731
3
1.1k
コンテキストエンジニアリングで変わるAI活用 リファクタリングワークフローの実践から学んだ形式知
leveragestech
0
100
オープンセミナー2025@広島「君はどこで動かすか?」アンケート結果
satoshi256kbyte
0
140
大規模FlutterプロジェクトのCI実行時間を約8割削減した話
teamlab
PRO
0
490
未来を拓くAI技術〜エージェント開発とAI駆動開発〜
leveragestech
2
160
Featured
See All Featured
Typedesign – Prime Four
hannesfritz
42
2.8k
The World Runs on Bad Software
bkeepers
PRO
70
11k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
50k
The Power of CSS Pseudo Elements
geoffreycrofte
77
5.9k
Side Projects
sachag
455
43k
Adopting Sorbet at Scale
ufuk
77
9.5k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.4k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
110
20k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
656
61k
Done Done
chrislema
185
16k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
RailsConf 2023
tenderlove
30
1.2k
Transcript
The Walking Dead A Survival Guide to Resilient Reactive Applications
Michael Nitschinger @daschl
the right Mindset 2
– U.S. Marine Corps “The more you sweat in peace,
the less you bleed in war.” 3
4
5
Not so fast, mister fancy tests! 6
What can go wrong? Always ask yourself 7
Fault Tolerance 101 8
Fault Error Failure A fault is a latent defect that
can cause an error when activated. 9
Fault Error Failure Errors are the manifestations of faults. 10
Fault Error Failure Failure occurs when the service no longer
complies with its specifications. 11
Fault Error Failure Errors are inevitable. We need to detect,
recover and mitigate them before they become failures. 12
Reliability is the probability that a system will perform failure
free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13
Availability is the percentage of time the system is able
to perform its function. availability = MTTF MTTF + MTTR 14
Expression Downtime/Year Three 9s 99.9% 525.6 min Four 9s 99.99%
52.56 min Four 9s and a 5 99.995% 26.28 min Five 9s 99.999% 5.256 min Six 9s 99.9999% 0.5256 min 100% 0 15
Pop Quiz! Edge Service User Service Session Store Data Warehouse
Wanted: 99.99% Availability ??? ??? ??? 16
Pop Quiz! Edge Service User Service Session Store Data Warehouse
Wanted: 99.99% Availability 99.99% 17 99.99% 99.99%
Pop Quiz! Edge Service User Service Session Store Data Warehouse
Wanted: 99.99% Availability ~99.999% ~99.999% ~99.999% 18
Fault Tolerant Architecture 19
Units of Mitigation are the basic units of error containment
and recovery. 20
Escalation is used when recovery or mitigation is not possible
inside the unit. 21
Escalation 22 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Escalation 23 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Escalation 24 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Escalation 25 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Redundancy Cost Active/Active Active/Standby N+M Active/Passive Cost Time To Recover
26
The Fault Observer receives system and error events and can
guide and orchestrate detection and recovery Unit Unit Observer Listener Listener Unit Unit 27
28
29
Detecting Errors 30
A silent system is a dead system. 31
A System Monitor helps to study behaviour and to make
sure it is operating as specified. http://upload.wikimedia.org/wikipedia/commons/3/3b/Mission_control_center.jpg 32
https://github.com/Netflix/Turbine 33
Periodic Checking Heartbeats monitor tasks or remote services and initiate
recovery Routine Exercises prevent idle unit starvation and surface malfunctions 34
35 Encoder( Encoder( Ne*y( Writes( Ne*y( Reads( Decoder( Decoder( Event
on Idle No Traffic Endpoint
Riding over Transients is used to defer error recovery if
the error is temporary. “‘Patience is a virtue’ to allow the true signature of an error to show itself.” - Robert S. Hanmer 36
37
And more! • Complete Parameter Checking • Watchdogs • Voting
• Checksums • Routine Audits 38
Recovery and Mitigation of Errors 39
Timeout to not wait forever and keep holding up the
resource. 40 X
Failover to a redundant unit when the error has been
detected and isolated. Cost Active/Active Active/Standby N+M Cost Time To Recover Redundancy Reminder 41
Intelligent Retries Time between Retries Number of Attempts Fixed Linear
Exponential 42
Restart can be used as a last resort with the
trade-off to lose state and time. 43
Fail Fast to shed load and give a partial great
service than a complete bad one. Boundary 44
Backpressure & Batching! 45
Case Study: Hystrix https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png 46
And more! • Rollback • Roll-Forward • Checkpoints • Data
Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 47
And more! • Rollback • Roll-Forward • Checkpoints • Data
Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 48
Recommended Reading 49
Patterns for Fault-Tolerant Software by Robert S. Hanmer 50
Release It! by Michael T. Nygard 51
Any Questions? 52
twitter @daschl email
[email protected]
Thank you! 53