Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Design for Retry (Nodevember)
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Aria Stewart
November 15, 2014
Programming
58
0
Share
Design for Retry (Nodevember)
Aria Stewart
November 15, 2014
More Decks by Aria Stewart
See All by Aria Stewart
Nuts and Bolts of Internationalization
aredridel
0
240
Design for Retry (Oneshot Budapest)
aredridel
0
70
Other Decks in Programming
See All in Programming
Why Laravel apps break—Mastering the fundamentals to keep them maintainable
kentaroutakeda
1
310
バックエンドにElysiaJSを採用して気付いた、良い点・悪い点
wanko_it
1
190
Modding RubyKaigi for Myself
yui_knk
0
800
AIエージェントの隔離技術の徹底比較
kawayu
0
440
TSKaigi2026-静的解析への投資がAI時代のコード品質を支える ── カスタムESLintルールの設計と運用
hayatokudou
6
1.3k
CLIであることを活かしたGitHub Copilot CLI活用術 / GitHub Copilot CLI Pro Tips & Tricks
nao_mk2
1
1.1k
横断組織出身のQAEがインプロセスQAEでつまずいたこと・活かせたこと
ty89
0
460
Old Dog, New Tricks: The Java 25 Reinvention - JNation
bazlur_rahman
0
130
不変条件と整合性境界—ビジネスが決める設計判断と実現パターン / Invariants and Consistency Boundaries
nrslib
11
3k
CSC307 Lecture 17
javiergs
PRO
0
270
inferと仲良くなる10分間
ryokatsuse
1
270
RTSPクライアントを自作してみた話
simotin13
0
320
Featured
See All Featured
Visualization
eitanlees
152
17k
Designing Experiences People Love
moore
143
24k
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
Code Reviewing Like a Champion
maltzj
528
40k
Paper Plane
katiecoart
PRO
1
50k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
2
380
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
270
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
49
10k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.9k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
250
1.3M
Being A Developer After 40
akosma
91
590k
Transcript
None
Hi! I'm Aria Stewart, that's @aredridel on just about every
service out there. Right now I'm an engineer at PayPal, working on the open source Kraken.js framework.
I'm going to talk about errors. It's going to be
okay.
We all know HTTP
if (err) { alert(err.message); } else { doMyThing(); }
2xx OK 3xx Go elsewhere 4xx Tell user what they
did wrong 5xx Bail out and log an error I'd call this Error avoidance
You can't avoid errors
Here's the secret Handle errors instead
4xx Tell the user what they did wrong 5xx Save
that request and do something with it later.
Retry it 5xx are errors the requestor can handle
But you can't just do things twice? We must make
operations idempotent
Idempotency Repeated actions have no effect, give the same result
This means being smart about IDs. Don't recycle! Check if things are already done. They are? Just give the same answer again.
Causes! • database down • bug in a service •
Deploy in progress • power failure • kicked a cable • Network congestion • Capacity exceeded • Microbursts
• Tree fell on the data center • earthquake •
tornado • birds, snakes and aeroplanes • Black Friday • Slashdot effect • Interns • QA tests • DoS attack
You need a queue
Lots of ways to do it Database on the nodes
Log file Queue server
gearman Queues built in There are many alternatives, but gearmand
is very simple. The memcache of job queues.
Three statuses: • OK (Like 200) • FAIL (Like 400)
• ERROR (Like 500)
design so ERROR can be retried.
gearmand automatically tries a job ERROR again. And again. And
again.
If it isn't sure it worked? Tries it again.
You cannot know if an error is a failure.
Error handling gets simpler • Exception? ERROR. • Database down?
ERROR. • Downstream service timeout? ERROR. Maybe you retry right away.
How many of you have used a job queue?
You have used a job queue
Let me tell you about one TRILLIONS of messages MILLIONS
of nodes 100% availability (at least partial) for years. 32 years. Resilient to MILLIONS of bad actors. It is attached to the most malicious network.
EMAIL. 250 OK 4xx RETRY 5xx Fail
Responsibility for messages 250 - accept responsibility 4xx - reject
responsibility 5xx - return responsibility
reject responsibility. If there's an error? Fail fast. The requester
can retry.
Fail fast. Queue work you can't reject. Reject everything you
can if there is an error.
You need a smart client. Keeps outstanding requests. Resubmit. Try
a different server! Try a second queue service. Maybe have a fallback plan.
Smart Clients on the device Toto, we're not in AWS
anymore.
Ever lose an email because you've been logged out?
Latency + Mutable state = Distributed system CAP Theorem Applies!
C = Consistency If there's state that one part knows
of that another doesn't? That's inconsistency.
Job queues are controlled inconsistency.
Ever try to write email on the web while not
on the Internet? It's cloud easy!
This is really good for offline- first design! Being offline
is the ultimate retriable error.
Some ideas
Queue things in localStorage
Use third-party storage
Integrate third-party services with this approach.
Use different strategies for available resources vs contended
Thank you! I hope you have lots of ideas queued
up. Save your ideas and unspool them onto Twitter when you get home. Let me know if this changed how you think about designing applications!