Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
The Walking Dead - A Survival Guide to Resilien...
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Michael Nitschinger
April 23, 2015
Programming
380
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
The Walking Dead - A Survival Guide to Resilient Reactive Applications
This talk was given at JAX 2015 in Mainz.
Michael Nitschinger
April 23, 2015
More Decks by Michael Nitschinger
See All by Michael Nitschinger
High Performance JVM Networking with Netty
daschl
5
1.3k
Reactive Data Access with RxJava... and N1QL!
daschl
0
220
Spark with Couchbase
daschl
0
160
Reactive Data Access with RxJava ... and N1QL!
daschl
0
190
The Walking Dead - A Survival Guide to Resilient Reactive Applications
daschl
0
240
State of the Art JVM Networking with Netty
daschl
2
460
The Walking Dead - A Survival Guide to Resilient Reactive Applications
daschl
1
460
The Walking Dead - A Survival Guide to Resilient Applications
daschl
0
1.4k
Building a Reactive Database Driver on the JVM
daschl
2
970
Other Decks in Programming
See All in Programming
メソッドのジェネリクスでGoの夢は広がるか? / Kyoto.go #65
utgwkk
3
980
ローカルLLMでどこまでコードが書けるか -拡張版 / How much code can be written on a local LLM Extended
kishida
12
4.5k
不変条件と整合性境界—ビジネスが決める設計判断と実現パターン / Invariants and Consistency Boundaries
nrslib
14
5.9k
過去最大のMCPアップデート! 2026-07-28 RC版の謎に迫る
licux
6
410
正しくソフトウェアを作る、前提を疑うための認知の視点 / doubt-premise
minodriven
21
7.1k
AIで効率化できた業務・日常
ochtum
0
150
Semantic Version 単位で戦略を柔軟に変えて、パッケージアップデートを自動化する
daitasu
1
310
はてなアカウント基盤 State of the Union
cockscomb
1
920
AI駆動開発を妨げる技術的負債の解消アプローチ / ai-refactoring-approach
minodriven
15
7.4k
dRuby over BLE
makicamel
2
390
Hatena Engineer Seminar #37「言語モデルの活用に関する研究」
slashnephy
0
230
技術的負債解消で開発者の未来を開く- AIの力でコード刷新
kmd2kmd
0
120
Featured
See All Featured
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
67
55k
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
1k
jQuery: Nuts, Bolts and Bling
dougneiner
66
8.5k
Testing 201, or: Great Expectations
jmmastey
46
8.2k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
55k
Claude Code のすすめ
schroneko
67
230k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
10k
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
2k
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
400
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
240
Kristin Tynski - Automating Marketing Tasks With AI
techseoconnect
PRO
0
280
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
6k
Transcript
Michael Nitschinger | Couchbase, Inc. The Walking Dead A Survival
Guide to Reactive Resilient Applications
the right Mindset 2
– U.S. Marine Corps “The more you sweat in peace,
the less you bleed in war.” 3
4
5
Not so fast, mister fancy tests! 6
What can go wrong? Always ask yourself 7
Fault Tolerance 101 8
Fault Error Failure A fault is a latent defect that
can cause an error when activated. 9
Fault Error Failure Errors are the manifestations of faults. 10
Fault Error Failure Failure occurs when the service no longer
complies with its specifications. 11
Fault Error Failure Errors are inevitable. We need to detect,
recover and mitigate them before they become failures. 12
Reliability is the probability that a system will perform failure
free for a given amount of time. MTTF Mean Time To Failure MTTR Mean Time To Repair 13
Availability is the percentage of time the system is able
to perform its function. availability = MTTF MTTF + MTTR 14
Expression Downtime/Year Three 9s 99.9% 525.6 min Four 9s 99.99%
52.56 min Four 9s and a 5 99.995% 26.28 min Five 9s 99.999% 5.256 min Six 9s 99.9999% 0.5256 min 100% 0 15
Pop Quiz! Edge Service User Service Session Store Data Warehouse
Wanted: 99.99% Availability ??? ??? ??? 16
Pop Quiz! Edge Service User Service Session Store Data Warehouse
Wanted: 99.99% Availability 99.99% 17 99.99% 99.99%
Pop Quiz! Edge Service User Service Session Store Data Warehouse
Wanted: 99.99% Availability ~99.999% ~99.999% ~99.999% 18
Fault Tolerant Architecture 19
Units of Mitigation are the basic units of error containment
and recovery. 20
Escalation is used when recovery or mitigation is not possible
inside the unit. 21
Escalation 22 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Escalation 23 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Escalation 24 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Escalation 25 Cluster Node Node Service Service Service Service Service
Endpoint Endpoint Endpoint Endpoint Endpoint
Redundancy Cost Active/Active Active/Standby N+M Active/Passive Cost Time To Recover
26
The Fault Observer receives system and error events and can
guide and orchestrate detection and recovery Unit Unit Observer Listener Listener Unit Unit 27
28
29
Detecting Errors 30
A silent system is a dead system. 31
A System Monitor helps to study behaviour and to make
sure it is operating as specified. 32 http://cdn-www.airliners.net/aviation-photos/photos/9/2/1/0982129.jpg
https://github.com/Netflix/Turbine 33
Periodic Checking Heartbeats monitor tasks or remote services and initiate
recovery Routine Exercises prevent idle unit starvation and surface malfunctions 34
35 Encoder( Encoder( Ne*y( Writes( Ne*y( Reads( Decoder( Decoder( Event
on Idle No Traffic Endpoint
Riding over Transients is used to defer error recovery if
the error is temporary. “‘Patience is a virtue’ to allow the true signature of an error to show itself.” - Robert S. Hanmer 36
37 The Leaky Bucket
And more! • Complete Parameter Checking • Watchdogs • Voting
• Checksums • Routine Audits 38
Recovery and Mitigation of Errors 39
Timeout to not wait forever and keep holding up the
resource. 40 X
Failover to a redundant unit when the error has been
detected and isolated. Cost Active/Active Active/Standby N+M Cost Time To Recover Redundancy Reminder 41
Intelligent Retries Time between Retries Number of Attempts Fixed Linear
Exponential 42
Restart can be used as a last resort with the
trade-off to lose state and time. 43
Fail Fast to shed load and give a partial great
service than a complete bad one. Boundary 44
Backpressure & Batching! 45
Case Study: Hystrix https://raw.githubusercontent.com/wiki/Netflix/Hystrix/images/hystrix-flow-chart-original.png 46
And more! • Rollback • Roll-Forward • Checkpoints • Data
Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 47
And more! • Rollback • Roll-Forward • Checkpoints • Data
Reset Recovery Mitigation • Bounded Queuing • Expansive Controls • Marking Data • Error Correcting Codes 48
Recommended Reading 49
Patterns for Fault-Tolerant Software by Robert S. Hanmer 50
Release It! by Michael T. Nygard 51
Announcement CB Server 4.0 dp! 52 http://blog.couchbase.com/introducing-developer-preview-for-couchbase-server-4.0
Any Questions? 53
twitter @daschl email
[email protected]
Thank you! 54