Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Design for Retry (Oneshot Budapest)
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Aria Stewart
November 21, 2014
Programming
67
0
Share
Design for Retry (Oneshot Budapest)
Aria Stewart
November 21, 2014
More Decks by Aria Stewart
See All by Aria Stewart
Nuts and Bolts of Internationalization
aredridel
0
230
Design for Retry (Nodevember)
aredridel
0
55
Other Decks in Programming
See All in Programming
AIと共にエンジニアとPMの “二刀流”を実現する
naruogram
0
130
我々はなぜ「層」を分けるのか〜「関心の分離」と「抽象化」で手に入れる変更に強いシンプルな設計〜 #phperkaigi / PHPerKaigi 2026
shogogg
2
810
PCOVから学ぶコードカバレッジ #phpcon_odawara
o0h
PRO
0
220
実践CRDT
tamadeveloper
0
320
Smarter Angular mit Transformers.js & Prompt API
christianliebel
PRO
1
120
飯MCP
yusukebe
0
480
Go_College_最終発表資料__外部公開用_.pdf
xe_pc23
0
130
Kubernetes上でAgentを動かすための最新動向と押さえるべき概念まとめ
sotamaki0421
3
440
年間50登壇、単著出版、雑誌寄稿、Podcast出演、YouTube、CM、カンファレンス主催……全部やってみたので面白さ等を比較してみよう / I’ve tried them all, so let’s compare how interesting they are.
nrslib
4
720
Strategy for Finding a Problem for OSS: With Real Examples
kibitan
0
140
The free-lunch guide to idea circularity
hollycummins
0
420
煩雑なSkills管理をSoC(関心の分離)により解決する――関心を分離し、プロンプトを部品として育てるためのOSSを作った話 / Solving Complex Skills Management Through SoC (Separation of Concerns)
nrslib
3
670
Featured
See All Featured
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.5k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
68
38k
Are puppies a ranking factor?
jonoalderson
1
3.2k
Agile Actions for Facilitating Distributed Teams - ADO2019
mkilby
0
170
Darren the Foodie - Storyboard
khoart
PRO
3
3.2k
Amusing Abliteration
ianozsvald
1
150
Build your cross-platform service in a week with App Engine
jlugia
234
18k
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
53k
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.6k
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
160
Navigating the moral maze — ethical principles for Al-driven product design
skipperchong
2
320
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
64
53k
Transcript
Design for Retry: Microservices, REST, and why Idempotency is the
only way to scale I'm Aria Stewart, that's @aredridel just about everywhere. I'm here thanks to PayPal. I work on the open source Kraken.js framework.
I'm going to talk about errors. It's going to be
okay.
if (err) { alert(err.message); } else { doMyThing(); }
We all know HTTP
2xx OK 3xx Go elsewhere 4xx Tell user what they
did wrong 5xx Bail out and log an error I'd call this Error avoidance
You can't avoid errors
Here's the secret Handle errors instead
4xx Tell the user what they did wrong 5xx Save
that request and do something with it later.
Retry it 5xx are errors the requestor can handle
But you can't just do things twice? We must make
operations idempotent
Idempotency Repeated actions have no effect, give the same result
This means being smart about IDs. Don't recycle! Check if things are already done. They are? Just give the same answer again.
Causes! —database down —bug in a service —Deploy in progress
—power failure —kicked a cable —Network congestion —Capacity exceeded —Microbursts
—Tree fell on the data center —earthquake —tornado —birds, snakes
and aeroplanes —Black Friday —Slashdot effect —Interns —QA tests —DoS attack
You need a queue
Lots of ways to do it Database on each node.
Maybe LevelDB? Log file Queue server
gearman Queues built in There are many alternatives, but gearmand
is very simple. The memcache of job queues.
Three statuses: —OK (Like 200) —FAIL (Like 400) —ERROR (Like
500)
design so ERROR can be retried.
gearmand automatically tries a job ERROR again. And again. And
again.
If it isn't sure it worked? Tries it again.
You cannot know if an error is a failure.
Error handling gets simpler —Exception? ERROR. —Database down? ERROR. —Downstream
service timeout? ERROR. Maybe you retry right away.
How many of you have used a job queue?
You have used a job queue
Let me tell you about one TRILLIONS of messages MILLIONS
of nodes 100% availability (at least partial) for years. 32 years. Resilient to MILLIONS of bad actors. It is attached to the most malicious network.
EMAIL. 250 OK 4xx RETRY 5xx Fail
Responsibility for messages 250 - accept responsibility 4xx - reject
responsibility 5xx - return responsibility
reject responsibility. If there's an error? Fail fast. The requester
can retry.
Fail fast. Queue work you can't reject. Reject everything you
can if there is an error.
You need a smart client. Keeps outstanding requests. Resubmit. Try
a different server! Try a second queue service. Maybe have a fallback plan.
Smart Clients on the device Toto, we're not in AWS
anymore.
Ever lose an email because you've been logged out?
Latency + Mutable state = Distributed system CAP Theorem Applies!
C = Consistency If there's state that one part knows
of that another doesn't? That's inconsistency.
Job queues are controlled inconsistency.
Ever try to write email on the web while not
on the Internet? It's cloud easy!
This is really good for offline-first design! Being offline is
the ultimate retriable error.
Some ideas
Use your queue as a place to measure for system
sizing
Queue things in localStorage
Use third-party storage
Integrate third-party services with this approach.
Use different strategies for available resources vs contended
Thank you! I hope you have lots of ideas queued
up. Save your ideas and unspool them onto Twitter when you get home. Let me know if this changed how you think about designing applications!