Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Crash Only Software
Search
Antoine Grondin
April 28, 2016
Programming
0
130
Crash Only Software
Antoine Grondin
April 28, 2016
Tweet
Share
Other Decks in Programming
See All in Programming
Flutter with Dart MCP: All You Need - 박제창 2025 I/O Extended Busan
itsmedreamwalker
0
150
Reading Rails 1.0 Source Code
okuramasafumi
0
250
テストコードはもう書かない:JetBrains AI Assistantに委ねる非同期処理のテスト自動設計・生成
makun
0
540
ユーザーも開発者も悩ませない TV アプリ開発 ~Compose の内部実装から学ぶフォーカス制御~
taked137
0
190
AIを活用し、今後に備えるための技術知識 / Basic Knowledge to Utilize AI
kishida
22
5.9k
個人軟體時代
ethanhuang13
0
330
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
420
JSONataを使ってみよう Step Functionsが楽しくなる実践テクニック #devio2025
dafujii
1
640
Updates on MLS on Ruby (and maybe more)
sylph01
1
180
そのAPI、誰のため? Androidライブラリ設計における利用者目線の実践テクニック
mkeeda
2
2.8k
AI Agents: How Do They Work and How to Build Them @ Shift 2025
slobodan
0
110
OSS開発者という働き方
andpad
5
1.7k
Featured
See All Featured
Imperfection Machines: The Place of Print at Facebook
scottboms
268
13k
Automating Front-end Workflow
addyosmani
1370
200k
Testing 201, or: Great Expectations
jmmastey
45
7.7k
How GitHub (no longer) Works
holman
315
140k
Designing for humans not robots
tammielis
253
25k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
127
53k
Designing for Performance
lara
610
69k
Code Reviewing Like a Champion
maltzj
525
40k
Unsuck your backbone
ammeep
671
58k
Intergalactic Javascript Robots from Outer Space
tanoku
272
27k
Why Our Code Smells
bkeepers
PRO
339
57k
Faster Mobile Websites
deanohume
309
31k
Transcript
Happiness through Crash-Only Software @AntoineGrondin
Who am I
None
None
None
None
None
None
None
crash-only
crash-only Catchy term that describes a way of coding and
organizing your infrastructure.
crash-only sounds like a crazy idea
crash-only it’s not =]
crash-only Failure doesn’t throw system into chaos.
crash-only Equilibrium toward progress and consistency.
litmus test
litmus test kill -9 every 10s
kill -9 Is progress still happening?
kill -9 Will you need to urgently wake up?
kill -9 Can you say “wtv, it’ll fix itself”
kill -9 Will you keep all important data?
kill -9 Will your system remain consistent?
litmus test Answered no to any of the previous?
None
None
None
None
None
preventing failure is a lost battle
it’s a lost battle Can only push failure into more
and more corners.
it’s a lost battle Failure minimization code adds complexity.
it’s a lost battle Added complexity induces more failures.
it’s a lost battle Attempts to fight failures result in
more failures.
None
None
only way to win is to choose not to fight
choose not to fight Failures will happen no matter what
you do.
choose not to fight Be pessimistic, assume it will happen
all the time.
choose not to fight How to design a pessimistic system
while minimizing complexity?
None
write less code
crash-only
crash-only: write less code collapse code paths
None
None
None
None
None
None
None
write less code
None
None
None
None
write less code do not gracefully shut things down
write less code do not gracefully shut things down false
illusion of safety
write less code do not gracefully shut things down counterproductive
write less code do not gracefully shut things down adds
complexity -> adds failure modes
beyond a single component
crash-only architecture made of crash-only components respect a contract
#1 servers try to process requests or crash
#2 clients send requests until success or TTL
#3 most components are stateless
#4 non-volatile state pushed outside of applications
distinguish between types of data #5
pick proper type of datastore based on volatility #6
communicate in a specific manner #7
communications
requests must be self-describing communications
requests must be self-describing time to live (TTL) communications
requests must be self-describing is_idempotent flag communications
None
on failure: retry_after gives a chance to crash-only component to
restart communications
exponential randomized backoffs communications
every resource acquisition is leased communications
communications
None
None
None
None
None
None
None
None
None
None
None
None
communication sidebar: exactly-once delivery
when you send exactly 1 message to exactly 1 server
exactly-once delivery
hint: it’s impossible exactly-once delivery
only 2 alternatives exactly-once delivery
only 2 alternatives: at-least-once delivery at-most-once delivery exactly-once delivery
IF message IS idempotent upon failure, retry_after until TTL exactly
at-least-once delivery
IF message IS NOT idempotent upon failure, rollback exactly at-most-once
delivery
None
None
None
None
delivery tradeoff I can see… tradeoffs… everywhere…
request have deadline request have is_idempotent flag servers try to
process or crash clients retry until deadline crash-only communication
state
volatile state
non-volatile state
crash-only datastores
RDBMS: postgres crash-only datastores
KV: BerkeleyDB crash-only datastores
append-log based datastores crash-only datastores
real-story
Disk Image Converter at DigitalOcean real-story
requirements real-story
Need to convert millions of images when we change format.
(say qcow to raw) real-story - requirements
Want to migrate ASAP to avoid running legacy code. real-story
- requirements
Don’t want to watch the thing. real-story - requirements
design real-story
Acceptable: at-least-once delivery real-story - design
at-least-once delivery If an image converted >1 time, we don’t
really care. Waste some time, better than losing customer image. real-story - design
non-volatile data: stored in crash-only DB real-story - design
volatile data: stored in Redis (not crash-only) real-story - design
fancy schema real-story - design
all jobs in a queue real-story - design
job leases stored in hash with TTL real-story - design
workers look for jobs without leases real-story - design
Step 1: make a lease Step 2: refresh lease while
working Step 3: delete job+lease once done real-story - design
Step 1: make a lease Step 2: refresh lease while
working Step 3: delete job+lease once done real-story - design
Step 1: make a lease Step 2: refresh lease while
working Step 3: delete job+lease once done real-story - design
if worker dies? real-story - design
if worker dies: its leases expire jobs will get picked
up again real-story - design
if worker dies: its leases expire jobs will get picked
up again real-story - design
if worker fails to delete job once done? real-story -
design
job will be performed again that’s ok real-story - design
if redis crashes? real-story - design
can reconstruct the queue from primary data source real-story -
design
outcome real-story
when issue arose real-story - outcome
get paged by operator real-story - outcome
tell operator to let it go real-story - outcome
tell operator to let it go … the process manager
restarts components real-story - outcome
tell operator to let it go … if issues, system
converges toward progress real-story - outcome
never had to fix anything (touches wood) real-story - outcome
still in use real-story - outcome
and we’re happy real-story - outcome
key points real-story
non-volatile state stored in crash-only DB real-story - key points
volatile state stored in crash-unsafe DB (more like a message
bus) real-story - key points
resources are leased real-story - key points
resources are self-describing real-story - key points
at-least-once vs. at-most-once tradeoff deliberately thought out real-story - key
points
crash-only
is it a panacea? crash-only
no crash-only
other important considerations remain crash-only
caveats
Things That Are Still Important caveats
Things That Are Still Important: Circuit Breakers caveats
Things That Are Still Important: Fallbacks caveats
Things That Are Still Important: Error Recovery caveats
Things That Are Still Important: Degraded Modes caveats
Things That Are Still Important: Cattle vs Pets caveats
Things That Are Still Important: Error Tracking/Reporting caveats
Things That Are Still Important: Debugging Crashed Components caveats
Goes in hand with those strategies caveats
It’s not a Free Lunch caveats
It’s not a Free Lunch (it’s a good step toward
one) caveats
conclusion
Crash-Only software is great! conclusion
Don’t need to wake up! Crash-Only Software Is Great
Stuff just works. Crash-Only Software Is Great
Failures are fun to look at. Crash-Only Software Is Great
It feels good. Crash-Only Software Is Great
Easy to fix when it goes wrong. Crash-Only Software Is
Great
Reduces complexity in code. Crash-Only Software Is Great
Don’t need to wake up! Crash-Only Software Is Great
References Recursive Restartability Candea & Fox, 2001 Crash-Only software Candea
& Fox, 2003 Crash-Only software, More than meets the eye LWN.net, https://lwn.net/Articles/191059/ A Crash Course In Failure NPlus1.org, http://web.archive.org/web/20090430014122/http://nplus1.org/articles/a-crash- course-in-failure/
La Fin comments, ideas: tweet me @AntoineGrondin