Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's 10pm: Do You Know Where Your Writes Are?

It's 10pm: Do You Know Where Your Writes Are?

Presented October 12, 2017 at MongoDB.local San Francisco.

Jeremy Mikola

October 12, 2017
Tweet

More Decks by Jeremy Mikola

Other Decks in Programming

Transcript

  1. It’s 10PM: Do you know
    where your writes are?
    Jeremy Mikola
    jmikola

    View Slide

  2. It’s 11AM: Do you know
    where your writes are?
    Jeremy Mikola
    jmikola

    View Slide

  3. On the roadmap
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View Slide

  4. Retryable Writes

    View Slide

  5. Retryable Writes

    View Slide

  6. Retryable Writes

    View Slide

  7. You’re updating a document

    View Slide

  8. You’re updating a document
    db.coll.updateOne(
    { _id: 16 },
    { $inc: { count: 1 }}
    );

    View Slide

  9. Murphy’s Law kicks in

    View Slide

  10. Is it safe to retry the update?

    View Slide

  11. Did our message never
    make it to the server?

    View Slide

  12. Did we lose the server’s reply?

    View Slide

  13. Did something else happen?

    View Slide

  14. Did something else happen?

    View Slide

  15. Did something else happen?

    View Slide

  16. Did something else happen?

    View Slide

  17. Did something else happen?

    View Slide

  18. There’s no way to retrieve
    an operation’s state

    View Slide

  19. Let’s review some
    best practices
    How To Write Resilient
    MongoDB Applications

    View Slide

  20. This was our update…
    db.coll.updateOne(
    { _id: 16 },
    { $inc: { count: 1 }}
    );

    View Slide

  21. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    Always
    retry
    Retry
    once

    View Slide

  22. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    Always
    retry
    Retry
    once

    View Slide

  23. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK
    Always
    retry
    Retry
    once

    View Slide

  24. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    Retry
    once

    View Slide

  25. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Retry
    once

    View Slide

  26. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Retry
    once

    View Slide

  27. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once

    View Slide

  28. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once
    May overcount

    View Slide

  29. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once
    May overcount OK

    View Slide

  30. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once
    May overcount OK OK

    View Slide

  31. There’s no good solution for
    transient network errors

    View Slide

  32. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once

    View Slide

  33. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once
    OK

    View Slide

  34. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once
    OK OK

    View Slide

  35. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once
    OK OK OK

    View Slide

  36. Safe-to-retry inserts
    db.coll.insertOne(
    { _id: 18, name: "Alice" }
    );

    View Slide

  37. Safe-to-retry deletes
    db.coll.deleteOne(
    { _id: 20 }
    );
    db.coll.deleteMany(
    { status: "inactive" }
    );

    View Slide

  38. Safe-to-retry updates
    db.coll.updateOne(
    { _id: 22 },
    { $set: { status: "active" }}
    );

    View Slide

  39. Why can’t we retrieve
    an operation’s state?

    View Slide

  40. In MongoDB 3.4, state is
    tied to connection objects

    View Slide

  41. MongoDB 3.6 introduces
    logical sessions

    View Slide

  42. MongoDB 3.6 introduces
    logical sessions
    Sessions allow us to maintain cluster-wide
    state about the user and their operations.

    View Slide

  43. MongoDB 3.6 introduces
    logical sessions
    Sessions allow us to maintain cluster-wide
    state about the user and their operations.
    Sessions are not tied to connections.

    View Slide

  44. Retrying writes with a session

    View Slide

  45. Retrying writes with a session

    View Slide

  46. Retrying writes with a session

    View Slide

  47. Retrying writes with a session
    update

    View Slide

  48. Retrying writes with a session

    View Slide

  49. Retrying writes with a session
    update

    View Slide

  50. Retrying writes with a session

    View Slide

  51. We can trust the server to
    Do the Right Thing™

    View Slide

  52. We can trust the server to
    Do the Right Thing™
    If the write already executed,
    return the result we missed.

    View Slide

  53. We can trust the server to
    Do the Right Thing™
    If the write already executed,
    return the result we missed.
    If the write never executed,
    do it now and return its result.

    View Slide

  54. Sessions are cluster-wide

    View Slide

  55. Sessions are cluster-wide
    update

    View Slide

  56. Sessions are cluster-wide
    update

    View Slide

  57. Sessions are cluster-wide

    View Slide

  58. Sessions are cluster-wide

    View Slide

  59. Sessions are cluster-wide
    update

    View Slide

  60. Taking advantage of
    retryable writes
    ?retryWrites=true
    mongodb://…

    View Slide

  61. One down, two to go
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View Slide

  62. Zombie Cursor Cleanup

    View Slide

  63. Zombie Cursor Cleanup

    View Slide

  64. You’re running a long query

    View Slide

  65. You’re running a long query
    cursor = db.coll.find();
    cursor.forEach(function() {
    // lengthy processing…
    });

    View Slide

  66. You’re running a long query
    cursor = db.coll.find();
    cursor.forEach(function() {
    // lengthy processing…
    });

    View Slide

  67. Cursors have a timeout

    View Slide

  68. Cursors have a timeout
    A er 10 minutes, the server will
    close a cursor due to inactivity.

    View Slide

  69. Cursors have a timeout
    A er 10 minutes, the server will
    close a cursor due to inactivity.
    Issuing a getMore
    resets the clock.

    View Slide

  70. Disabling cursor timeouts
    cursor = db.coll.find(
    { },
    { noCursorTimeout: true }
    );
    cursor.forEach(function() {
    // lengthy processing…

    View Slide

  71. Disabling cursor timeouts
    cursor = db.coll.find(
    { },
    { noCursorTimeout: true }
    );
    cursor.forEach(function() {
    // lengthy processing…

    View Slide

  72. Executing our long query

    View Slide

  73. Executing our long query
    find

    View Slide

  74. Executing our long query

    View Slide

  75. Executing our long query
    getMore

    View Slide

  76. Executing our long query

    View Slide

  77. Executing our long query
    getMore

    View Slide

  78. Executing our long query

    View Slide

  79. Executing our long query
    getMore

    View Slide

  80. Executing our long query
    getMore

    View Slide

  81. Executing our long query
    getMore

    View Slide

  82. Executing our long query
    getMore

    View Slide

  83. Executing our long query
    getMore

    View Slide

  84. Executing our long query
    getMore

    View Slide

  85. Executing our long query
    getMore

    View Slide

  86. View Slide

  87. View Slide

  88. View Slide

  89. View Slide

  90. A zombie cursor is born
    > db.serverStatus()
    {
    "metrics": {
    "cursor": {
    "open": {
    "noTimeout": 1,
    "total": 1

    View Slide

  91. What happened last night?

    View Slide

  92. What happened last night?
    (from the server’s POV)

    View Slide

  93. What happened last night?
    (from the server’s POV)

    View Slide

  94. What happened last night?
    (from the server’s POV)
    find

    View Slide

  95. What happened last night?
    (from the server’s POV)

    View Slide

  96. What happened last night?
    (from the server’s POV)
    getMore

    View Slide

  97. What happened last night?
    (from the server’s POV)

    View Slide

  98. What happened last night?
    (from the server’s POV)
    getMore

    View Slide

  99. What happened last night?
    (from the server’s POV)

    View Slide

  100. What happened last night?
    (from the server’s POV)

    View Slide

  101. What happened last night?
    (from the server’s POV)

    View Slide

  102. Avoiding zombie cursors
    with logical sessions

    View Slide

  103. Avoiding zombie cursors
    with logical sessions
    Sessions also have a timeout.

    View Slide

  104. Avoiding zombie cursors
    with logical sessions
    Sessions also have a timeout.
    We can associate
    queries with a session

    View Slide

  105. Querying with a session
    session = client.startSession();
    cursor = db.coll.find(
    { },
    { session: session }
    );

    View Slide

  106. Executing our long query

    View Slide

  107. Executing our long query

    View Slide

  108. Executing our long query

    View Slide

  109. Executing our long query
    find

    View Slide

  110. Executing our long query

    View Slide

  111. Executing our long query
    getMore

    View Slide

  112. Executing our long query

    View Slide

  113. Executing our long query
    getMore

    View Slide

  114. Executing our long query

    View Slide

  115. Executing our long query

    View Slide

  116. Executing our long query
    session expires

    View Slide

  117. Executing our long query

    View Slide

  118. Did we just punt on
    the timeout issue?

    View Slide

  119. Session timeouts are
    non-negotiable

    View Slide

  120. Session timeouts are
    non-negotiable
    Idle sessions will expire.

    View Slide

  121. Session timeouts are
    non-negotiable
    Idle sessions will expire.
    Any operation using the
    session resets the clock.

    View Slide

  122. Two down, one to go
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View Slide

  123. Cluster-wide killOp

    View Slide

  124. Cluster-wide killOp

    View Slide

  125. You’re running an operation
    that may never complete

    View Slide

  126. You’re running an operation
    that may never complete
    cursor = db.coll.find(
    { … } // table scans for days
    );

    View Slide

  127. You’ve made a
    terrible mistake

    View Slide

  128. Step 1: Find the operation ID
    > db.currentOp()
    {
    "inprog" : [
    {
    "desc" : "conn2",
    "threadId" : "140181791471360",
    "connectionId" : 2,
    "client" : "127.0.0.1:49456",
    "appName" : "MongoDB Shell",
    "active" : true,
    "opid" : 132921,

    View Slide

  129. Step 2: Kill the operation ID
    > db.killOp(132921)
    {
    "info": "attempting to kill op",
    "ok": 1
    }

    View Slide

  130. Lather, rinse, repeat

    View Slide

  131. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")

    View Slide

  132. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")
    > db.currentOp()
    {
    "inprog" : [
    // …
    ]
    }

    View Slide

  133. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")
    > db.currentOp()
    {
    "inprog" : [
    // …
    ]
    }
    > db.killOp(…)

    View Slide

  134. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")
    > db.currentOp()
    {
    "inprog" : [
    // …
    ]
    }
    > db.killOp(…)

    View Slide

  135. How did this happen?
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  136. How did this happen?
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  137. How did this happen?
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  138. Cluster-wide killOp
    with logical sessions

    View Slide

  139. Cluster-wide killOp
    with logical sessions
    Any operation may be
    associated with a session.

    View Slide

  140. Cluster-wide killOp
    with logical sessions
    Any operation may be
    associated with a session.
    Terminating a session will end
    all of its associated operations.

    View Slide

  141. Terminating a session
    session = client.startSession();
    cursor = db.coll.find(
    { … }, // table scans for days
    { session: session }
    );

    View Slide

  142. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  143. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  144. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  145. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  146. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  147. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  148. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  149. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  150. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  151. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View Slide

  152. That’s a wrap
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View Slide

  153. One last point

    View Slide

  154. Resilence is primarily
    the driver’s domain

    View Slide

  155. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring

    View Slide

  156. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring
    Elections and failover recovery

    View Slide

  157. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring
    Elections and failover recovery
    Load-balancing mongos connections

    View Slide

  158. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring
    Elections and failover recovery
    Load-balancing mongos connections
    Routing queries by read preference

    View Slide

  159. Addressing resilence
    on the server-side

    View Slide

  160. Addressing resilence
    on the server-side
    Tracking operation state

    View Slide

  161. Addressing resilence
    on the server-side
    Tracking operation state
    Cluster-wide sessions

    View Slide

  162. Providing a relatively
    easy upgrade path

    View Slide

  163. Providing a relatively
    easy upgrade path
    No need to rewrite applications

    View Slide

  164. Providing a relatively
    easy upgrade path
    No need to rewrite applications
    Opting in to retryable writes

    View Slide

  165. Providing a relatively
    easy upgrade path
    No need to rewrite applications
    Opting in to retryable writes
    New API for client session objects

    View Slide

  166. Providing a relatively
    easy upgrade path
    No need to rewrite applications
    Opting in to retryable writes
    New API for client session objects
    Pass session option as needed

    View Slide

  167. Inside the spec process
    mongodb/specifications

    View Slide

  168. Inside the spec process
    /sessions
    mongodb/specifications

    View Slide

  169. Inside the spec process
    /sessions
    /retryable-writes
    mongodb/specifications

    View Slide

  170. Inside the spec process
    /sessions
    /retryable-writes
    /causal-consistency
    mongodb/specifications

    View Slide

  171. Inside the spec process
    /sessions
    /retryable-writes
    /causal-consistency
    /retryable-reads
    mongodb/specifications

    View Slide

  172. In the meantime…
    How To Write Resilient
    MongoDB Applications

    View Slide

  173. Thanks!

    View Slide