Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Crash Only Software
Search
Sponsored
·
Ship Features Fearlessly
Turn features on and off without deploys. Used by thousands of Ruby developers.
→
Antoine Grondin
April 28, 2016
Programming
0
130
Crash Only Software
Antoine Grondin
April 28, 2016
Tweet
Share
Other Decks in Programming
See All in Programming
AIと一緒にレガシーに向き合ってみた
nyafunta9858
0
200
CSC307 Lecture 09
javiergs
PRO
1
830
カスタマーサクセス業務を変革したヘルススコアの実現と学び
_hummer0724
0
680
疑似コードによるプロンプト記述、どのくらい正確に実行される?
kokuyouwind
0
380
インターン生でもAuth0で認証基盤刷新が出来るのか
taku271
0
190
Unicodeどうしてる? PHPから見たUnicode対応と他言語での対応についてのお伺い
youkidearitai
PRO
1
2.5k
余白を設計しフロントエンド開発を 加速させる
tsukuha
7
2.1k
AIによるイベントストーミング図からのコード生成 / AI-powered code generation from Event Storming diagrams
nrslib
2
1.9k
Honoを使ったリモートMCPサーバでAIツールとの連携を加速させる!
tosuri13
1
180
AIエージェントのキホンから学ぶ「エージェンティックコーディング」実践入門
masahiro_nishimi
5
420
Oxlintはいいぞ
yug1224
5
1.3k
FOSDEM 2026: STUNMESH-go: Building P2P WireGuard Mesh Without Self-Hosted Infrastructure
tjjh89017
0
160
Featured
See All Featured
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
200
How GitHub (no longer) Works
holman
316
140k
The Power of CSS Pseudo Elements
geoffreycrofte
80
6.2k
How to train your dragon (web standard)
notwaldorf
97
6.5k
技術選定の審美眼(2025年版) / Understanding the Spiral of Technologies 2025 edition
twada
PRO
117
110k
Between Models and Reality
mayunak
1
180
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.3k
Agile Actions for Facilitating Distributed Teams - ADO2019
mkilby
0
110
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.2k
Bridging the Design Gap: How Collaborative Modelling removes blockers to flow between stakeholders and teams @FastFlow conf
baasie
0
450
Become a Pro
speakerdeck
PRO
31
5.8k
Transcript
Happiness through Crash-Only Software @AntoineGrondin
Who am I
None
None
None
None
None
None
None
crash-only
crash-only Catchy term that describes a way of coding and
organizing your infrastructure.
crash-only sounds like a crazy idea
crash-only it’s not =]
crash-only Failure doesn’t throw system into chaos.
crash-only Equilibrium toward progress and consistency.
litmus test
litmus test kill -9 every 10s
kill -9 Is progress still happening?
kill -9 Will you need to urgently wake up?
kill -9 Can you say “wtv, it’ll fix itself”
kill -9 Will you keep all important data?
kill -9 Will your system remain consistent?
litmus test Answered no to any of the previous?
None
None
None
None
None
preventing failure is a lost battle
it’s a lost battle Can only push failure into more
and more corners.
it’s a lost battle Failure minimization code adds complexity.
it’s a lost battle Added complexity induces more failures.
it’s a lost battle Attempts to fight failures result in
more failures.
None
None
only way to win is to choose not to fight
choose not to fight Failures will happen no matter what
you do.
choose not to fight Be pessimistic, assume it will happen
all the time.
choose not to fight How to design a pessimistic system
while minimizing complexity?
None
write less code
crash-only
crash-only: write less code collapse code paths
None
None
None
None
None
None
None
write less code
None
None
None
None
write less code do not gracefully shut things down
write less code do not gracefully shut things down false
illusion of safety
write less code do not gracefully shut things down counterproductive
write less code do not gracefully shut things down adds
complexity -> adds failure modes
beyond a single component
crash-only architecture made of crash-only components respect a contract
#1 servers try to process requests or crash
#2 clients send requests until success or TTL
#3 most components are stateless
#4 non-volatile state pushed outside of applications
distinguish between types of data #5
pick proper type of datastore based on volatility #6
communicate in a specific manner #7
communications
requests must be self-describing communications
requests must be self-describing time to live (TTL) communications
requests must be self-describing is_idempotent flag communications
None
on failure: retry_after gives a chance to crash-only component to
restart communications
exponential randomized backoffs communications
every resource acquisition is leased communications
communications
None
None
None
None
None
None
None
None
None
None
None
None
communication sidebar: exactly-once delivery
when you send exactly 1 message to exactly 1 server
exactly-once delivery
hint: it’s impossible exactly-once delivery
only 2 alternatives exactly-once delivery
only 2 alternatives: at-least-once delivery at-most-once delivery exactly-once delivery
IF message IS idempotent upon failure, retry_after until TTL exactly
at-least-once delivery
IF message IS NOT idempotent upon failure, rollback exactly at-most-once
delivery
None
None
None
None
delivery tradeoff I can see… tradeoffs… everywhere…
request have deadline request have is_idempotent flag servers try to
process or crash clients retry until deadline crash-only communication
state
volatile state
non-volatile state
crash-only datastores
RDBMS: postgres crash-only datastores
KV: BerkeleyDB crash-only datastores
append-log based datastores crash-only datastores
real-story
Disk Image Converter at DigitalOcean real-story
requirements real-story
Need to convert millions of images when we change format.
(say qcow to raw) real-story - requirements
Want to migrate ASAP to avoid running legacy code. real-story
- requirements
Don’t want to watch the thing. real-story - requirements
design real-story
Acceptable: at-least-once delivery real-story - design
at-least-once delivery If an image converted >1 time, we don’t
really care. Waste some time, better than losing customer image. real-story - design
non-volatile data: stored in crash-only DB real-story - design
volatile data: stored in Redis (not crash-only) real-story - design
fancy schema real-story - design
all jobs in a queue real-story - design
job leases stored in hash with TTL real-story - design
workers look for jobs without leases real-story - design
Step 1: make a lease Step 2: refresh lease while
working Step 3: delete job+lease once done real-story - design
Step 1: make a lease Step 2: refresh lease while
working Step 3: delete job+lease once done real-story - design
Step 1: make a lease Step 2: refresh lease while
working Step 3: delete job+lease once done real-story - design
if worker dies? real-story - design
if worker dies: its leases expire jobs will get picked
up again real-story - design
if worker dies: its leases expire jobs will get picked
up again real-story - design
if worker fails to delete job once done? real-story -
design
job will be performed again that’s ok real-story - design
if redis crashes? real-story - design
can reconstruct the queue from primary data source real-story -
design
outcome real-story
when issue arose real-story - outcome
get paged by operator real-story - outcome
tell operator to let it go real-story - outcome
tell operator to let it go … the process manager
restarts components real-story - outcome
tell operator to let it go … if issues, system
converges toward progress real-story - outcome
never had to fix anything (touches wood) real-story - outcome
still in use real-story - outcome
and we’re happy real-story - outcome
key points real-story
non-volatile state stored in crash-only DB real-story - key points
volatile state stored in crash-unsafe DB (more like a message
bus) real-story - key points
resources are leased real-story - key points
resources are self-describing real-story - key points
at-least-once vs. at-most-once tradeoff deliberately thought out real-story - key
points
crash-only
is it a panacea? crash-only
no crash-only
other important considerations remain crash-only
caveats
Things That Are Still Important caveats
Things That Are Still Important: Circuit Breakers caveats
Things That Are Still Important: Fallbacks caveats
Things That Are Still Important: Error Recovery caveats
Things That Are Still Important: Degraded Modes caveats
Things That Are Still Important: Cattle vs Pets caveats
Things That Are Still Important: Error Tracking/Reporting caveats
Things That Are Still Important: Debugging Crashed Components caveats
Goes in hand with those strategies caveats
It’s not a Free Lunch caveats
It’s not a Free Lunch (it’s a good step toward
one) caveats
conclusion
Crash-Only software is great! conclusion
Don’t need to wake up! Crash-Only Software Is Great
Stuff just works. Crash-Only Software Is Great
Failures are fun to look at. Crash-Only Software Is Great
It feels good. Crash-Only Software Is Great
Easy to fix when it goes wrong. Crash-Only Software Is
Great
Reduces complexity in code. Crash-Only Software Is Great
Don’t need to wake up! Crash-Only Software Is Great
References Recursive Restartability Candea & Fox, 2001 Crash-Only software Candea
& Fox, 2003 Crash-Only software, More than meets the eye LWN.net, https://lwn.net/Articles/191059/ A Crash Course In Failure NPlus1.org, http://web.archive.org/web/20090430014122/http://nplus1.org/articles/a-crash- course-in-failure/
La Fin comments, ideas: tweet me @AntoineGrondin