Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
It's 10pm: Do You Know Where Your Writes Are?
Search
Jeremy Mikola
October 12, 2017
Programming
0
180
It's 10pm: Do You Know Where Your Writes Are?
Presented October 12, 2017 at MongoDB.local San Francisco.
Jeremy Mikola
October 12, 2017
Tweet
Share
More Decks by Jeremy Mikola
See All by Jeremy Mikola
PHP Internals for the Inquisitive Developer
jmikola
1
580
Bulletproof MongoDB
jmikola
0
440
Zero to Sixty with MongoDB
jmikola
3
1.1k
DOs and DON’Ts of MongoDB
jmikola
13
3k
Five Years of Beta
jmikola
0
130
Rethinking Extension Development for PHP and HHVM
jmikola
2
760
What's New in MongoDB 3.2
jmikola
0
88
Async PHP with React
jmikola
28
11k
NoSQL Lightning Talks (MongoDB, Cassandra, MySQL)
jmikola
1
210
Other Decks in Programming
See All in Programming
Pythonでもちょっとリッチな見た目のアプリを設計してみる
ueponx
1
270
Compose でデザインと実装の差異を減らすための取り組み
oidy
1
280
AWS Organizations で実現する、 マルチ AWS アカウントのルートユーザー管理からの脱却
atpons
0
110
DMMオンラインサロンアプリのSwift化
hayatan
0
280
ISUCON14感想戦で85万点まで頑張ってみた
ponyo877
1
790
令和7年版 あなたが使ってよいフロントエンド機能とは
mugi_uno
12
6.1k
『改訂新版 良いコード/悪いコードで学ぶ設計入門』活用方法−爆速でスキルアップする!効果的な学習アプローチ / effective-learning-of-good-code
minodriven
29
4.9k
ゼロからの、レトロゲームエンジンの作り方
tokujiros
3
1.2k
ESLintプラグインを使用してCDKのセオリーを適用する
yamanashi_ren01
2
420
Alba: Why, How and What's So Interesting
okuramasafumi
0
240
為你自己學 Python
eddie
0
540
Java Webフレームワークの現状 / java web framework at burikaigi
kishida
9
2.1k
Featured
See All Featured
For a Future-Friendly Web
brad_frost
176
9.5k
Large-scale JavaScript Application Architecture
addyosmani
510
110k
The MySQL Ecosystem @ GitHub 2015
samlambert
250
12k
What's in a price? How to price your products and services
michaelherold
244
12k
Imperfection Machines: The Place of Print at Facebook
scottboms
267
13k
Designing Experiences People Love
moore
139
23k
Unsuck your backbone
ammeep
669
57k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
232
17k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
160
15k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
VelocityConf: Rendering Performance Case Studies
addyosmani
327
24k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
3.7k
Transcript
It’s 10PM: Do you know where your writes are? Jeremy
Mikola jmikola
It’s 11AM: Do you know where your writes are? Jeremy
Mikola jmikola
On the roadmap Retryable writes Zombie cursor cleanup Cluster-wide killOp
Retryable Writes
Retryable Writes
Retryable Writes
You’re updating a document
You’re updating a document db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Murphy’s Law kicks in
Is it safe to retry the update?
Did our message never make it to the server?
Did we lose the server’s reply?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
Did something else happen?
There’s no way to retrieve an operation’s state
Let’s review some best practices How To Write Resilient MongoDB
Applications
This was our update… db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );
Errors and retry strategies Transient network error Persistent outage Command
error Never retry Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK
Errors and retry strategies Transient network error Persistent outage Command
error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK OK
There’s no good solution for transient network errors
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK
We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once OK OK OK
Safe-to-retry inserts db.coll.insertOne( { _id: 18, name: "Alice" } );
Safe-to-retry deletes db.coll.deleteOne( { _id: 20 } ); db.coll.deleteMany( {
status: "inactive" } );
Safe-to-retry updates db.coll.updateOne( { _id: 22 }, { $set: {
status: "active" }} );
Why can’t we retrieve an operation’s state?
In MongoDB 3.4, state is tied to connection objects
MongoDB 3.6 introduces logical sessions
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations.
MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations. Sessions are not tied to connections.
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
Retrying writes with a session update
Retrying writes with a session
We can trust the server to Do the Right Thing™
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed.
We can trust the server to Do the Right Thing™
If the write already executed, return the result we missed. If the write never executed, do it now and return its result.
Sessions are cluster-wide
Sessions are cluster-wide update
Sessions are cluster-wide update
Sessions are cluster-wide
Sessions are cluster-wide
Sessions are cluster-wide update
Taking advantage of retryable writes ?retryWrites=true mongodb://…
One down, two to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Zombie Cursor Cleanup
Zombie Cursor Cleanup
You’re running a long query
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });
Cursors have a timeout
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity.
Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity. Issuing a getMore resets the clock.
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
Executing our long query getMore
None
None
None
None
A zombie cursor is born > db.serverStatus() { "metrics": {
"cursor": { "open": { "noTimeout": 1, "total": 1
What happened last night?
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) find
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV) getMore
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
What happened last night? (from the server’s POV)
Avoiding zombie cursors with logical sessions
Avoiding zombie cursors with logical sessions Sessions also have a
timeout.
Avoiding zombie cursors with logical sessions Sessions also have a
timeout. We can associate queries with a session
Querying with a session session = client.startSession(); cursor = db.coll.find(
{ }, { session: session } );
Executing our long query
Executing our long query
Executing our long query
Executing our long query find
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query getMore
Executing our long query
Executing our long query
Executing our long query session expires
Executing our long query
Did we just punt on the timeout issue?
Session timeouts are non-negotiable
Session timeouts are non-negotiable Idle sessions will expire.
Session timeouts are non-negotiable Idle sessions will expire. Any operation
using the session resets the clock.
Two down, one to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp
Cluster-wide killOp
Cluster-wide killOp
You’re running an operation that may never complete
You’re running an operation that may never complete cursor =
db.coll.find( { … } // table scans for days );
You’ve made a terrible mistake
Step 1: Find the operation ID > db.currentOp() { "inprog"
: [ { "desc" : "conn2", "threadId" : "140181791471360", "connectionId" : 2, "client" : "127.0.0.1:49456", "appName" : "MongoDB Shell", "active" : true, "opid" : 132921,
Step 2: Kill the operation ID > db.killOp(132921) { "info":
"attempting to kill op", "ok": 1 }
Lather, rinse, repeat
Lather, rinse, repeat > connect("mongodb://shard-2.example.com")
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] }
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
How did this happen? mongos shard 1 shard 2 shard
3
Cluster-wide killOp with logical sessions
Cluster-wide killOp with logical sessions Any operation may be associated
with a session.
Cluster-wide killOp with logical sessions Any operation may be associated
with a session. Terminating a session will end all of its associated operations.
Terminating a session session = client.startSession(); cursor = db.coll.find( {
… }, // table scans for days { session: session } );
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
Querying with sessions mongos shard 1 shard 2 shard 3
That’s a wrap Retryable writes Zombie cursor cleanup Cluster-wide killOp
One last point
Resilence is primarily the driver’s domain
Resilence is primarily the driver’s domain Server discovery and monitoring
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections
Resilence is primarily the driver’s domain Server discovery and monitoring
Elections and failover recovery Load-balancing mongos connections Routing queries by read preference
Addressing resilence on the server-side
Addressing resilence on the server-side Tracking operation state
Addressing resilence on the server-side Tracking operation state Cluster-wide sessions
Providing a relatively easy upgrade path
Providing a relatively easy upgrade path No need to rewrite
applications
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects
Providing a relatively easy upgrade path No need to rewrite
applications Opting in to retryable writes New API for client session objects Pass session option as needed
Inside the spec process mongodb/specifications
Inside the spec process /sessions mongodb/specifications
Inside the spec process /sessions /retryable-writes mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency mongodb/specifications
Inside the spec process /sessions /retryable-writes /causal-consistency /retryable-reads mongodb/specifications
In the meantime… How To Write Resilient MongoDB Applications
Thanks!