Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
It's 10pm: Do You Know Where Your Writes Are?
Search
Jeremy Mikola
October 12, 2017
Programming
0
240
It's 10pm: Do You Know Where Your Writes Are?
Presented October 12, 2017 at MongoDB.local San Francisco.
Jeremy Mikola
October 12, 2017
Tweet
Share
More Decks by Jeremy Mikola
See All by Jeremy Mikola
PHP Internals for the Inquisitive Developer
jmikola
1
740
Bulletproof MongoDB
jmikola
0
510
Zero to Sixty with MongoDB
jmikola
3
1.1k
DOs and DON’Ts of MongoDB
jmikola
13
3.2k
Five Years of Beta
jmikola
0
170
Rethinking Extension Development for PHP and HHVM
jmikola
2
930
What's New in MongoDB 3.2
jmikola
0
140
Async PHP with React
jmikola
28
11k
NoSQL Lightning Talks (MongoDB, Cassandra, MySQL)
jmikola
1
290
Other Decks in Programming
See All in Programming
Swift ConcurrencyでよりSwiftyに
yuukiw00w
0
240
2026年は Rust 置き換えが流行る! / 20260220-niigata-5min-tech
girigiribauer
0
220
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
390
The Past, Present, and Future of Enterprise Java
ivargrimstad
0
230
New in Go 1.26 Implementing go fix in product development
sunecosuri
0
330
Codex の「自走力」を高める
yorifuji
0
270
Unity6.3 AudioUpdate
cova8bitdots
0
110
LangChain4jとは一味違うLangChain4j-CDI
kazumura
1
150
猫の手も借りたい!ので AIエージェント猫を作って社内に放した話 Claude Code × Container Lambda の Slack Bot "DevNeko"
naramomi7
0
240
maplibre-gl-layers - 地図に移動体たくさん表示したい
kekyo
PRO
0
180
ご飯食べながらエージェントが開発できる。そう、Agentic Engineeringならね。
yokomachi
1
280
Agent Skills Workshop - AIへの頼み方を仕組み化する
gotalab555
14
7.9k
Featured
See All Featured
The State of eCommerce SEO: How to Win in Today's Products SERPs - #SEOweek
aleyda
2
9.8k
Digital Ethics as a Driver of Design Innovation
axbom
PRO
1
210
Leading Effective Engineering Teams in the AI Era
addyosmani
9
1.7k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
63
53k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
280
Thoughts on Productivity
jonyablonski
75
5.1k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Stewardship and Sustainability of Urban and Community Forests
pwiseman
0
130
How to Think Like a Performance Engineer
csswizardry
28
2.5k
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
280
HDC tutorial
michielstock
1
500
Balancing Empowerment & Direction
lara
5
930
Transcript
It’s 10PM: Do you know where your writes are? Jeremy
Mikola jmikola
It’s 11AM: Do you know where your writes are? Jeremy
Mikola jmikola
On the roadmap Retryable writes Zombie cursor cleanup Cluster-wide killOp
Retryable Writes
Retryable Writes
Retryable Writes
You’re updating a document
You’re updating a document db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Murphy’s Law kicks in
Is it safe to retry the update?
Did our message never make it to the server?
Did we lose the server’s reply?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
There’s no way to retrieve an operation’s state
Let’s review some best practices How To Write Resilient MongoDB
Applications
This was our update… db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Errors and retry strategies Transient network error Persistent outage Command
error Never retry Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK OK
There’s no good solution for transient network errors
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK OK
Safe-to-retry inserts db.coll.insertOne( { _id: 18, name: "Alice" } );
Safe-to-retry deletes db.coll.deleteOne( { _id: 20 } ); db.coll.deleteMany( {
status: "inactive" } );
Safe-to-retry updates db.coll.updateOne( { _id: 22 }, { $set: {
status: "active" }} );
Why can’t we retrieve an operation’s state?
In MongoDB 3.4, state is tied to connection objects
MongoDB 3.6 introduces logical sessions
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations.
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations. Sessions are not tied to connections.
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
We can trust the server to Do the Right Thing™
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed.
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed. If the write never executed, do it now and return its result.
Sessions are cluster-wide
Sessions are cluster-wide update
Sessions are cluster-wide update
Sessions are cluster-wide
Sessions are cluster-wide
Sessions are cluster-wide update
Taking advantage of retryable writes ?retryWrites=true mongodb://…
One down, two to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Zombie Cursor Cleanup
Zombie Cursor Cleanup
You’re running a long query
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
Cursors have a timeout
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity.
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity. Issuing a getMore resets the clock.
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
None
None
None
None
A zombie cursor is born > db.serverStatus() { "metrics": {
"cursor": { "open": { "noTimeout": 1, "total": 1
What happened last night?
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) find
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
Avoiding zombie cursors with logical sessions
Avoiding zombie cursors with logical sessions Sessions also have a
timeout.
Avoiding zombie cursors with logical sessions Sessions also have a
timeout. We can associate queries with a session
Querying with a session session = client.startSession(); cursor = db.coll.find(
{ }, { session: session } );
Executing our long query
Executing our long query
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query
Executing our long query session expires
Executing our long query
Did we just punt on the timeout issue?
Session timeouts are non-negotiable
Session timeouts are non-negotiable Idle sessions will expire.
Session timeouts are non-negotiable Idle sessions will expire. Any operation
using the session resets the clock.
Two down, one to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Cluster-wide killOp
Cluster-wide killOp
You’re running an operation that may never complete
You’re running an operation that may never complete cursor =
db.coll.find( { … } // table scans for days );
You’ve made a terrible mistake
Step 1: Find the operation ID > db.currentOp() { "inprog"
: [ { "desc" : "conn2", "threadId" : "140181791471360", "connectionId" : 2, "client" : "127.0.0.1:49456", "appName" : "MongoDB Shell", "active" : true, "opid" : 132921,
Step 2: Kill the operation ID > db.killOp(132921) { "info":
"attempting to kill op", "ok": 1 }
Lather, rinse, repeat
Lather, rinse, repeat > connect("mongodb://shard-2.example.com")
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] }
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
Cluster-wide killOp with logical sessions
Cluster-wide killOp with logical sessions Any operation may be associated
with a session.
Cluster-wide killOp with logical sessions Any operation may be associated
with a session. Terminating a session will end all of its associated operations.
Terminating a session session = client.startSession(); cursor = db.coll.find( {
… }, // table scans for days { session: session } );
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
That’s a wrap Retryable writes Zombie cursor cleanup Cluster-wide killOp
One last point
Resilence is primarily the driver’s domain
Resilence is primarily the driver’s domain Server discovery and monitoring
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections Routing queries by read preference
Addressing resilence on the server-side
Addressing resilence on the server-side Tracking operation state
Addressing resilence on the server-side Tracking operation state Cluster-wide sessions
Providing a relatively easy upgrade path
Providing a relatively easy upgrade path No need to rewrite
applications
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects Pass session option as needed
Inside the spec process mongodb/specifications
Inside the spec process /sessions mongodb/specifications
Inside the spec process /sessions /retryable-writes mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency /retryable-reads mongodb/specifications
In the meantime… How To Write Resilient MongoDB Applications
Thanks!