Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
It's 10pm: Do You Know Where Your Writes Are?
Search
Jeremy Mikola
October 12, 2017
Programming
0
240
It's 10pm: Do You Know Where Your Writes Are?
Presented October 12, 2017 at MongoDB.local San Francisco.
Jeremy Mikola
October 12, 2017
Tweet
Share
More Decks by Jeremy Mikola
See All by Jeremy Mikola
PHP Internals for the Inquisitive Developer
jmikola
1
730
Bulletproof MongoDB
jmikola
0
510
Zero to Sixty with MongoDB
jmikola
3
1.1k
DOs and DON’Ts of MongoDB
jmikola
13
3.2k
Five Years of Beta
jmikola
0
170
Rethinking Extension Development for PHP and HHVM
jmikola
2
930
What's New in MongoDB 3.2
jmikola
0
140
Async PHP with React
jmikola
28
11k
NoSQL Lightning Talks (MongoDB, Cassandra, MySQL)
jmikola
1
290
Other Decks in Programming
See All in Programming
Takumiから考えるSecurity_Maturity_Model.pdf
gessy0129
1
110
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
360
Claude Codeと2つの巻き戻し戦略 / Two Rewind Strategies with Claude Code
fruitriin
0
200
RubyとGoでゼロから作る証券システム: 高信頼性が求められるシステムのコードの外側にある設計と運用のリアル
free_world21
0
180
今更考える「単一責任原則」 / Thinking about the Single Responsibility Principle
tooppoo
3
1.3k
浮動小数の比較について
kishikawakatsumi
0
370
Geminiの機能を調べ尽くしてみた
naruyoshimi
0
190
2026年は Rust 置き換えが流行る! / 20260220-niigata-5min-tech
girigiribauer
0
220
Go1.26 go fixをプロダクトに適用して困ったこと
kurakura0916
0
320
CDIの誤解しがちな仕様とその対処TIPS
futokiyo
0
160
PJのドキュメントを全部Git管理にしたら、一番喜んだのはAIだった
nanaism
0
230
AHC061解説
shun_pi
0
310
Featured
See All Featured
State of Search Keynote: SEO is Dead Long Live SEO
ryanjones
0
150
The agentic SEO stack - context over prompts
schlessera
0
680
Building Flexible Design Systems
yeseniaperezcruz
330
40k
Being A Developer After 40
akosma
91
590k
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
1
140
Keith and Marios Guide to Fast Websites
keithpitt
413
23k
Lessons Learnt from Crawling 1000+ Websites
charlesmeaden
PRO
1
1.1k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
Site-Speed That Sticks
csswizardry
13
1.1k
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.6k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.4k
Transcript
It’s 10PM: Do you know where your writes are? Jeremy
Mikola jmikola
It’s 11AM: Do you know where your writes are? Jeremy
Mikola jmikola
On the roadmap Retryable writes Zombie cursor cleanup Cluster-wide killOp
Retryable Writes
Retryable Writes
Retryable Writes
You’re updating a document
You’re updating a document db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Murphy’s Law kicks in
Is it safe to retry the update?
Did our message never make it to the server?
Did we lose the server’s reply?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
There’s no way to retrieve an operation’s state
Let’s review some best practices How To Write Resilient MongoDB
Applications
This was our update… db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Errors and retry strategies Transient network error Persistent outage Command
error Never retry Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK OK
There’s no good solution for transient network errors
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK OK
Safe-to-retry inserts db.coll.insertOne( { _id: 18, name: "Alice" } );
Safe-to-retry deletes db.coll.deleteOne( { _id: 20 } ); db.coll.deleteMany( {
status: "inactive" } );
Safe-to-retry updates db.coll.updateOne( { _id: 22 }, { $set: {
status: "active" }} );
Why can’t we retrieve an operation’s state?
In MongoDB 3.4, state is tied to connection objects
MongoDB 3.6 introduces logical sessions
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations.
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations. Sessions are not tied to connections.
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
We can trust the server to Do the Right Thing™
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed.
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed. If the write never executed, do it now and return its result.
Sessions are cluster-wide
Sessions are cluster-wide update
Sessions are cluster-wide update
Sessions are cluster-wide
Sessions are cluster-wide
Sessions are cluster-wide update
Taking advantage of retryable writes ?retryWrites=true mongodb://…
One down, two to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Zombie Cursor Cleanup
Zombie Cursor Cleanup
You’re running a long query
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
Cursors have a timeout
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity.
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity. Issuing a getMore resets the clock.
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
None
None
None
None
A zombie cursor is born > db.serverStatus() { "metrics": {
"cursor": { "open": { "noTimeout": 1, "total": 1
What happened last night?
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) find
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
Avoiding zombie cursors with logical sessions
Avoiding zombie cursors with logical sessions Sessions also have a
timeout.
Avoiding zombie cursors with logical sessions Sessions also have a
timeout. We can associate queries with a session
Querying with a session session = client.startSession(); cursor = db.coll.find(
{ }, { session: session } );
Executing our long query
Executing our long query
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query
Executing our long query session expires
Executing our long query
Did we just punt on the timeout issue?
Session timeouts are non-negotiable
Session timeouts are non-negotiable Idle sessions will expire.
Session timeouts are non-negotiable Idle sessions will expire. Any operation
using the session resets the clock.
Two down, one to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Cluster-wide killOp
Cluster-wide killOp
You’re running an operation that may never complete
You’re running an operation that may never complete cursor =
db.coll.find( { … } // table scans for days );
You’ve made a terrible mistake
Step 1: Find the operation ID > db.currentOp() { "inprog"
: [ { "desc" : "conn2", "threadId" : "140181791471360", "connectionId" : 2, "client" : "127.0.0.1:49456", "appName" : "MongoDB Shell", "active" : true, "opid" : 132921,
Step 2: Kill the operation ID > db.killOp(132921) { "info":
"attempting to kill op", "ok": 1 }
Lather, rinse, repeat
Lather, rinse, repeat > connect("mongodb://shard-2.example.com")
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] }
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
Cluster-wide killOp with logical sessions
Cluster-wide killOp with logical sessions Any operation may be associated
with a session.
Cluster-wide killOp with logical sessions Any operation may be associated
with a session. Terminating a session will end all of its associated operations.
Terminating a session session = client.startSession(); cursor = db.coll.find( {
… }, // table scans for days { session: session } );
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
That’s a wrap Retryable writes Zombie cursor cleanup Cluster-wide killOp
One last point
Resilence is primarily the driver’s domain
Resilence is primarily the driver’s domain Server discovery and monitoring
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections Routing queries by read preference
Addressing resilence on the server-side
Addressing resilence on the server-side Tracking operation state
Addressing resilence on the server-side Tracking operation state Cluster-wide sessions
Providing a relatively easy upgrade path
Providing a relatively easy upgrade path No need to rewrite
applications
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects Pass session option as needed
Inside the spec process mongodb/specifications
Inside the spec process /sessions mongodb/specifications
Inside the spec process /sessions /retryable-writes mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency /retryable-reads mongodb/specifications
In the meantime… How To Write Resilient MongoDB Applications
Thanks!