It's 10pm: Do You Know Where Your Writes Are?

It’s 10PM: Do you know where your writes are? Jeremy
Mikola jmikola

It’s 11AM: Do you know where your writes are? Jeremy
Mikola jmikola

On the roadmap Retryable writes Zombie cursor cleanup Cluster-wide killOp

Retryable Writes

You’re updating a document

You’re updating a document db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );

Murphy’s Law kicks in

Is it safe to retry the update?

Did our message never make it to the server?

Did we lose the server’s reply?

Did something else happen?

There’s no way to retrieve an operation’s state

Let’s review some best practices How To Write Resilient MongoDB
Applications

This was our update… db.coll.updateOne( { _id: 16 }, {
$inc: { count: 1 }} );

Errors and retry strategies Transient network error Persistent outage Command
error Never retry Always retry Retry once

error Never retry May undercount Always retry Retry once

error Never retry May undercount OK Always retry Retry once

error Never retry May undercount OK OK Always retry Retry once

error Never retry May undercount OK OK Always retry May overcount Retry once

error Never retry May undercount OK OK Always retry May overcount Wastes time Retry once

error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once

error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount

error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK

error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK OK

There’s no good solution for transient network errors

We can safely retry idempotent operations Transient network error Persistent
outage Command error Retry once

outage Command error Retry once OK

outage Command error Retry once OK OK

outage Command error Retry once OK OK OK

Safe-to-retry inserts db.coll.insertOne( { _id: 18, name: "Alice" } );

Safe-to-retry deletes db.coll.deleteOne( { _id: 20 } ); db.coll.deleteMany( {
status: "inactive" } );

Safe-to-retry updates db.coll.updateOne( { _id: 22 }, { $set: {
status: "active" }} );

Why can’t we retrieve an operation’s state?

In MongoDB 3.4, state is tied to connection objects

MongoDB 3.6 introduces logical sessions

MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations.

MongoDB 3.6 introduces logical sessions Sessions allow us to maintain
cluster-wide state about the user and their operations. Sessions are not tied to connections.

Retrying writes with a session

Retrying writes with a session update

We can trust the server to Do the Right Thing™

If the write already executed, return the result we missed.

If the write already executed, return the result we missed. If the write never executed, do it now and return its result.

Sessions are cluster-wide

Sessions are cluster-wide update

Sessions are cluster-wide

Sessions are cluster-wide update

Taking advantage of retryable writes ?retryWrites=true mongodb://…

One down, two to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp

Zombie Cursor Cleanup

You’re running a long query

You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {
// lengthy processing… });

Cursors have a timeout

Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity.

Cursors have a timeout A er 10 minutes, the server
will close a cursor due to inactivity. Issuing a getMore resets the clock.

Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:
true } ); cursor.forEach(function() { // lengthy processing…

Executing our long query

Executing our long query find

Executing our long query getMore

A zombie cursor is born > db.serverStatus() { "metrics": {
"cursor": { "open": { "noTimeout": 1, "total": 1

What happened last night?

What happened last night? (from the server’s POV)

What happened last night? (from the server’s POV) find

What happened last night? (from the server’s POV) getMore

Avoiding zombie cursors with logical sessions

Avoiding zombie cursors with logical sessions Sessions also have a
timeout.

Avoiding zombie cursors with logical sessions Sessions also have a
timeout. We can associate queries with a session

Querying with a session session = client.startSession(); cursor = db.coll.find(
{ }, { session: session } );

Executing our long query find

Executing our long query session expires

Did we just punt on the timeout issue?

Session timeouts are non-negotiable

Session timeouts are non-negotiable Idle sessions will expire.

Session timeouts are non-negotiable Idle sessions will expire. Any operation
using the session resets the clock.

Two down, one to go Retryable writes Zombie cursor cleanup
Cluster-wide killOp

Cluster-wide killOp

You’re running an operation that may never complete

You’re running an operation that may never complete cursor =
db.coll.find( { … } // table scans for days );

You’ve made a terrible mistake

Step 1: Find the operation ID > db.currentOp() { "inprog"
: [ { "desc" : "conn2", "threadId" : "140181791471360", "connectionId" : 2, "client" : "127.0.0.1:49456", "appName" : "MongoDB Shell", "active" : true, "opid" : 132921,

Step 2: Kill the operation ID > db.killOp(132921) { "info":
"attempting to kill op", "ok": 1 }

Lather, rinse, repeat

Lather, rinse, repeat > connect("mongodb://shard-2.example.com")

Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] }

Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :
[ // … ] } > db.killOp(…)

How did this happen? mongos shard 1 shard 2 shard
3

Cluster-wide killOp with logical sessions

Cluster-wide killOp with logical sessions Any operation may be associated
with a session.

Cluster-wide killOp with logical sessions Any operation may be associated
with a session. Terminating a session will end all of its associated operations.

Terminating a session session = client.startSession(); cursor = db.coll.find( {
… }, // table scans for days { session: session } );

Querying with sessions mongos shard 1 shard 2 shard 3

That’s a wrap Retryable writes Zombie cursor cleanup Cluster-wide killOp

One last point

Resilence is primarily the driver’s domain

Resilence is primarily the driver’s domain Server discovery and monitoring

Elections and failover recovery

Elections and failover recovery Load-balancing mongos connections

Elections and failover recovery Load-balancing mongos connections Routing queries by read preference

Addressing resilence on the server-side

Addressing resilence on the server-side Tracking operation state

Addressing resilence on the server-side Tracking operation state Cluster-wide sessions

Providing a relatively easy upgrade path

Providing a relatively easy upgrade path No need to rewrite
applications

applications Opting in to retryable writes

applications Opting in to retryable writes New API for client session objects

applications Opting in to retryable writes New API for client session objects Pass session option as needed

Inside the spec process mongodb/specifications

Inside the spec process /sessions mongodb/specifications

Inside the spec process /sessions /retryable-writes mongodb/specifications

Inside the spec process /sessions /retryable-writes /causal-consistency mongodb/specifications

Inside the spec process /sessions /retryable-writes /causal-consistency /retryable-reads mongodb/specifications

In the meantime… How To Write Resilient MongoDB Applications

Thanks!

It's 10pm: Do You Know Where Your Writes Are?

It's 10pm: Do You Know Where Your Writes Are?

More Decks by Jeremy Mikola

Other Decks in Programming

Featured

Transcript