Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
It's 10pm: Do You Know Where Your Writes Are?
Search
Jeremy Mikola
October 12, 2017
Programming
0
230
It's 10pm: Do You Know Where Your Writes Are?
Presented October 12, 2017 at MongoDB.local San Francisco.
Jeremy Mikola
October 12, 2017
Tweet
Share
More Decks by Jeremy Mikola
See All by Jeremy Mikola
PHP Internals for the Inquisitive Developer
jmikola
1
710
Bulletproof MongoDB
jmikola
0
490
Zero to Sixty with MongoDB
jmikola
3
1.1k
DOs and DON’Ts of MongoDB
jmikola
13
3.2k
Five Years of Beta
jmikola
0
160
Rethinking Extension Development for PHP and HHVM
jmikola
2
890
What's New in MongoDB 3.2
jmikola
0
130
Async PHP with React
jmikola
28
11k
NoSQL Lightning Talks (MongoDB, Cassandra, MySQL)
jmikola
1
270
Other Decks in Programming
See All in Programming
Vibe codingでおすすめの言語と開発手法
uyuki234
0
150
AI Agent Tool のためのバックエンドアーキテクチャを考える #encraft
izumin5210
5
1.5k
TestingOsaka6_Ozono
o3
0
240
開発に寄りそう自動テストの実現
goyoki
2
1.6k
HTTPプロトコル正しく理解していますか? 〜かわいい猫と共に学ぼう。ฅ^•ω•^ฅ ニャ〜
hekuchan
2
580
SQL Server 2025 LT
odashinsuke
0
120
AIエージェントの設計で注意するべきポイント6選
har1101
6
2.8k
Jetpack XR SDKから紐解くAndroid XR開発と技術選定のヒント / about-androidxr-and-jetpack-xr-sdk
drumath2237
1
230
脳の「省エネモード」をデバッグする ~System 1(直感)と System 2(論理)の切り替え~
panda728
PRO
0
130
AtCoder Conference 2025「LLM時代のAHC」
imjk
2
620
ThorVG Viewer In VS Code
nors
0
500
AI時代を生き抜く 新卒エンジニアの生きる道
coconala_engineer
1
490
Featured
See All Featured
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
210
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
2.8k
Mobile First: as difficult as doing things right
swwweet
225
10k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.2k
Pawsitive SEO: Lessons from My Dog (and Many Mistakes) on Thriving as a Consultant in the Age of AI
davidcarrasco
0
39
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
70
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.4k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
130
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
81
Transcript
It’s 10PM: Do you know where your writes are? Jeremy
Mikola jmikola
It’s 11AM: Do you know where your writes are? Jeremy
Mikola jmikola
On the roadmap Retryable writes Zombie cursor cleanup Cluster-wide killOp
Retryable Writes
Retryable Writes
Retryable Writes
You’re updating a document
You’re updating a document db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Murphy’s Law kicks in
Is it safe to retry the update?
Did our message never make it to the server?
Did we lose the server’s reply?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
There’s no way to retrieve an operation’s state
Let’s review some best practices How To Write Resilient MongoDB
Applications
This was our update… db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Errors and retry strategies Transient network error Persistent outage Command
error Never retry Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK OK
There’s no good solution for transient network errors
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK OK
Safe-to-retry inserts db.coll.insertOne( { _id: 18, name: "Alice" } );
Safe-to-retry deletes db.coll.deleteOne( { _id: 20 } ); db.coll.deleteMany( {
status: "inactive" } );
Safe-to-retry updates db.coll.updateOne( { _id: 22 }, { $set: {
status: "active" }} );
Why can’t we retrieve an operation’s state?
In MongoDB 3.4, state is tied to connection objects
MongoDB 3.6 introduces logical sessions
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations.
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations. Sessions are not tied to connections.
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
We can trust the server to Do the Right Thing™
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed.
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed. If the write never executed, do it now and return its result.
Sessions are cluster-wide
Sessions are cluster-wide update
Sessions are cluster-wide update
Sessions are cluster-wide
Sessions are cluster-wide
Sessions are cluster-wide update
Taking advantage of retryable writes ?retryWrites=true mongodb://…
One down, two to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Zombie Cursor Cleanup
Zombie Cursor Cleanup
You’re running a long query
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
Cursors have a timeout
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity.
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity. Issuing a getMore resets the clock.
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
None
None
None
None
A zombie cursor is born > db.serverStatus() { "metrics": {
"cursor": { "open": { "noTimeout": 1, "total": 1
What happened last night?
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) find
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
Avoiding zombie cursors with logical sessions
Avoiding zombie cursors with logical sessions Sessions also have a
timeout.
Avoiding zombie cursors with logical sessions Sessions also have a
timeout. We can associate queries with a session
Querying with a session session = client.startSession(); cursor = db.coll.find(
{ }, { session: session } );
Executing our long query
Executing our long query
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query
Executing our long query session expires
Executing our long query
Did we just punt on the timeout issue?
Session timeouts are non-negotiable
Session timeouts are non-negotiable Idle sessions will expire.
Session timeouts are non-negotiable Idle sessions will expire. Any operation
using the session resets the clock.
Two down, one to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Cluster-wide killOp
Cluster-wide killOp
You’re running an operation that may never complete
You’re running an operation that may never complete cursor =
db.coll.find( { … } // table scans for days );
You’ve made a terrible mistake
Step 1: Find the operation ID > db.currentOp() { "inprog"
: [ { "desc" : "conn2", "threadId" : "140181791471360", "connectionId" : 2, "client" : "127.0.0.1:49456", "appName" : "MongoDB Shell", "active" : true, "opid" : 132921,
Step 2: Kill the operation ID > db.killOp(132921) { "info":
"attempting to kill op", "ok": 1 }
Lather, rinse, repeat
Lather, rinse, repeat > connect("mongodb://shard-2.example.com")
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] }
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
Cluster-wide killOp with logical sessions
Cluster-wide killOp with logical sessions Any operation may be associated
with a session.
Cluster-wide killOp with logical sessions Any operation may be associated
with a session. Terminating a session will end all of its associated operations.
Terminating a session session = client.startSession(); cursor = db.coll.find( {
… }, // table scans for days { session: session } );
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
That’s a wrap Retryable writes Zombie cursor cleanup Cluster-wide killOp
One last point
Resilence is primarily the driver’s domain
Resilence is primarily the driver’s domain Server discovery and monitoring
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections Routing queries by read preference
Addressing resilence on the server-side
Addressing resilence on the server-side Tracking operation state
Addressing resilence on the server-side Tracking operation state Cluster-wide sessions
Providing a relatively easy upgrade path
Providing a relatively easy upgrade path No need to rewrite
applications
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects Pass session option as needed
Inside the spec process mongodb/specifications
Inside the spec process /sessions mongodb/specifications
Inside the spec process /sessions /retryable-writes mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency /retryable-reads mongodb/specifications
In the meantime… How To Write Resilient MongoDB Applications
Thanks!