It's 10pm: Do You Know Where Your Writes Are?

It's 10pm: Do You Know Where Your Writes Are?

Presented October 12, 2017 at MongoDB.local San Francisco.

F23700b51dc0c196c1dc02f84aeeecdf?s=128

Jeremy Mikola

October 12, 2017
Tweet

Transcript

  1. It’s 10PM: Do you know where your writes are? Jeremy

    Mikola jmikola
  2. It’s 11AM: Do you know where your writes are? Jeremy

    Mikola jmikola
  3. On the roadmap Retryable writes Zombie cursor cleanup Cluster-wide killOp

  4. Retryable Writes

  5. Retryable Writes

  6. Retryable Writes

  7. You’re updating a document

  8. You’re updating a document db.coll.updateOne( { _id: 16 }, {

    $inc: { count: 1 }} );
  9. Murphy’s Law kicks in

  10. Is it safe to retry the update?

  11. Did our message never make it to the server?

  12. Did we lose the server’s reply?

  13. Did something else happen?

  14. Did something else happen?

  15. Did something else happen?

  16. Did something else happen?

  17. Did something else happen?

  18. There’s no way to retrieve an operation’s state

  19. Let’s review some best practices How To Write Resilient MongoDB

    Applications
  20. This was our update… db.coll.updateOne( { _id: 16 }, {

    $inc: { count: 1 }} );
  21. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry Always retry Retry once
  22. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount Always retry Retry once
  23. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK Always retry Retry once
  24. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK OK Always retry Retry once
  25. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK OK Always retry May overcount Retry once
  26. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK OK Always retry May overcount Wastes time Retry once
  27. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once
  28. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount
  29. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK
  30. Errors and retry strategies Transient network error Persistent outage Command

    error Never retry May undercount OK OK Always retry May overcount Wastes time Wastes time Retry once May overcount OK OK
  31. There’s no good solution for transient network errors

  32. We can safely retry idempotent operations Transient network error Persistent

    outage Command error Retry once
  33. We can safely retry idempotent operations Transient network error Persistent

    outage Command error Retry once OK
  34. We can safely retry idempotent operations Transient network error Persistent

    outage Command error Retry once OK OK
  35. We can safely retry idempotent operations Transient network error Persistent

    outage Command error Retry once OK OK OK
  36. Safe-to-retry inserts db.coll.insertOne( { _id: 18, name: "Alice" } );

  37. Safe-to-retry deletes db.coll.deleteOne( { _id: 20 } ); db.coll.deleteMany( {

    status: "inactive" } );
  38. Safe-to-retry updates db.coll.updateOne( { _id: 22 }, { $set: {

    status: "active" }} );
  39. Why can’t we retrieve an operation’s state?

  40. In MongoDB 3.4, state is tied to connection objects

  41. MongoDB 3.6 introduces logical sessions

  42. MongoDB 3.6 introduces logical sessions Sessions allow us to maintain

    cluster-wide state about the user and their operations.
  43. MongoDB 3.6 introduces logical sessions Sessions allow us to maintain

    cluster-wide state about the user and their operations. Sessions are not tied to connections.
  44. Retrying writes with a session

  45. Retrying writes with a session

  46. Retrying writes with a session

  47. Retrying writes with a session update

  48. Retrying writes with a session

  49. Retrying writes with a session update

  50. Retrying writes with a session

  51. We can trust the server to Do the Right Thing™

  52. We can trust the server to Do the Right Thing™

    If the write already executed, return the result we missed.
  53. We can trust the server to Do the Right Thing™

    If the write already executed, return the result we missed. If the write never executed, do it now and return its result.
  54. Sessions are cluster-wide

  55. Sessions are cluster-wide update

  56. Sessions are cluster-wide update

  57. Sessions are cluster-wide

  58. Sessions are cluster-wide

  59. Sessions are cluster-wide update

  60. Taking advantage of retryable writes ?retryWrites=true mongodb://…

  61. One down, two to go Retryable writes Zombie cursor cleanup

    Cluster-wide killOp
  62. Zombie Cursor Cleanup

  63. Zombie Cursor Cleanup

  64. You’re running a long query

  65. You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {

    // lengthy processing… });
  66. You’re running a long query cursor = db.coll.find(); cursor.forEach(function() {

    // lengthy processing… });
  67. Cursors have a timeout

  68. Cursors have a timeout A er 10 minutes, the server

    will close a cursor due to inactivity.
  69. Cursors have a timeout A er 10 minutes, the server

    will close a cursor due to inactivity. Issuing a getMore resets the clock.
  70. Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:

    true } ); cursor.forEach(function() { // lengthy processing…
  71. Disabling cursor timeouts cursor = db.coll.find( { }, { noCursorTimeout:

    true } ); cursor.forEach(function() { // lengthy processing…
  72. Executing our long query

  73. Executing our long query find

  74. Executing our long query

  75. Executing our long query getMore

  76. Executing our long query

  77. Executing our long query getMore

  78. Executing our long query

  79. Executing our long query getMore

  80. Executing our long query getMore

  81. Executing our long query getMore

  82. Executing our long query getMore

  83. Executing our long query getMore

  84. Executing our long query getMore

  85. Executing our long query getMore

  86. None
  87. None
  88. None
  89. None
  90. A zombie cursor is born > db.serverStatus() { "metrics": {

    "cursor": { "open": { "noTimeout": 1, "total": 1
  91. What happened last night?

  92. What happened last night? (from the server’s POV)

  93. What happened last night? (from the server’s POV)

  94. What happened last night? (from the server’s POV) find

  95. What happened last night? (from the server’s POV)

  96. What happened last night? (from the server’s POV) getMore

  97. What happened last night? (from the server’s POV)

  98. What happened last night? (from the server’s POV) getMore

  99. What happened last night? (from the server’s POV)

  100. What happened last night? (from the server’s POV)

  101. What happened last night? (from the server’s POV)

  102. Avoiding zombie cursors with logical sessions

  103. Avoiding zombie cursors with logical sessions Sessions also have a

    timeout.
  104. Avoiding zombie cursors with logical sessions Sessions also have a

    timeout. We can associate queries with a session
  105. Querying with a session session = client.startSession(); cursor = db.coll.find(

    { }, { session: session } );
  106. Executing our long query

  107. Executing our long query

  108. Executing our long query

  109. Executing our long query find

  110. Executing our long query

  111. Executing our long query getMore

  112. Executing our long query

  113. Executing our long query getMore

  114. Executing our long query

  115. Executing our long query

  116. Executing our long query session expires

  117. Executing our long query

  118. Did we just punt on the timeout issue?

  119. Session timeouts are non-negotiable

  120. Session timeouts are non-negotiable Idle sessions will expire.

  121. Session timeouts are non-negotiable Idle sessions will expire. Any operation

    using the session resets the clock.
  122. Two down, one to go Retryable writes Zombie cursor cleanup

    Cluster-wide killOp
  123. Cluster-wide killOp

  124. Cluster-wide killOp

  125. You’re running an operation that may never complete

  126. You’re running an operation that may never complete cursor =

    db.coll.find( { … } // table scans for days );
  127. You’ve made a terrible mistake

  128. Step 1: Find the operation ID > db.currentOp() { "inprog"

    : [ { "desc" : "conn2", "threadId" : "140181791471360", "connectionId" : 2, "client" : "127.0.0.1:49456", "appName" : "MongoDB Shell", "active" : true, "opid" : 132921,
  129. Step 2: Kill the operation ID > db.killOp(132921) { "info":

    "attempting to kill op", "ok": 1 }
  130. Lather, rinse, repeat

  131. Lather, rinse, repeat > connect("mongodb://shard-2.example.com")

  132. Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :

    [ // … ] }
  133. Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :

    [ // … ] } > db.killOp(…)
  134. Lather, rinse, repeat > connect("mongodb://shard-2.example.com") > db.currentOp() { "inprog" :

    [ // … ] } > db.killOp(…)
  135. How did this happen? mongos shard 1 shard 2 shard

    3
  136. How did this happen? mongos shard 1 shard 2 shard

    3
  137. How did this happen? mongos shard 1 shard 2 shard

    3
  138. Cluster-wide killOp with logical sessions

  139. Cluster-wide killOp with logical sessions Any operation may be associated

    with a session.
  140. Cluster-wide killOp with logical sessions Any operation may be associated

    with a session. Terminating a session will end all of its associated operations.
  141. Terminating a session session = client.startSession(); cursor = db.coll.find( {

    … }, // table scans for days { session: session } );
  142. Querying with sessions mongos shard 1 shard 2 shard 3

  143. Querying with sessions mongos shard 1 shard 2 shard 3

  144. Querying with sessions mongos shard 1 shard 2 shard 3

  145. Querying with sessions mongos shard 1 shard 2 shard 3

  146. Querying with sessions mongos shard 1 shard 2 shard 3

  147. Querying with sessions mongos shard 1 shard 2 shard 3

  148. Querying with sessions mongos shard 1 shard 2 shard 3

  149. Querying with sessions mongos shard 1 shard 2 shard 3

  150. Querying with sessions mongos shard 1 shard 2 shard 3

  151. Querying with sessions mongos shard 1 shard 2 shard 3

  152. That’s a wrap Retryable writes Zombie cursor cleanup Cluster-wide killOp

  153. One last point

  154. Resilence is primarily the driver’s domain

  155. Resilence is primarily the driver’s domain Server discovery and monitoring

  156. Resilence is primarily the driver’s domain Server discovery and monitoring

    Elections and failover recovery
  157. Resilence is primarily the driver’s domain Server discovery and monitoring

    Elections and failover recovery Load-balancing mongos connections
  158. Resilence is primarily the driver’s domain Server discovery and monitoring

    Elections and failover recovery Load-balancing mongos connections Routing queries by read preference
  159. Addressing resilence on the server-side

  160. Addressing resilence on the server-side Tracking operation state

  161. Addressing resilence on the server-side Tracking operation state Cluster-wide sessions

  162. Providing a relatively easy upgrade path

  163. Providing a relatively easy upgrade path No need to rewrite

    applications
  164. Providing a relatively easy upgrade path No need to rewrite

    applications Opting in to retryable writes
  165. Providing a relatively easy upgrade path No need to rewrite

    applications Opting in to retryable writes New API for client session objects
  166. Providing a relatively easy upgrade path No need to rewrite

    applications Opting in to retryable writes New API for client session objects Pass session option as needed
  167. Inside the spec process mongodb/specifications

  168. Inside the spec process /sessions mongodb/specifications

  169. Inside the spec process /sessions /retryable-writes mongodb/specifications

  170. Inside the spec process /sessions /retryable-writes /causal-consistency mongodb/specifications

  171. Inside the spec process /sessions /retryable-writes /causal-consistency /retryable-reads mongodb/specifications

  172. In the meantime… How To Write Resilient MongoDB Applications

  173. Thanks!