Upgrade to Pro — share decks privately, control downloads, hide ads and more …

It's 10pm: Do You Know Where Your Writes Are?

It's 10pm: Do You Know Where Your Writes Are?

Presented October 12, 2017 at MongoDB.local San Francisco.

Jeremy Mikola

October 12, 2017
Tweet

More Decks by Jeremy Mikola

Other Decks in Programming

Transcript

  1. It’s 10PM: Do you know
    where your writes are?
    Jeremy Mikola
    jmikola

    View full-size slide

  2. It’s 11AM: Do you know
    where your writes are?
    Jeremy Mikola
    jmikola

    View full-size slide

  3. On the roadmap
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View full-size slide

  4. Retryable Writes

    View full-size slide

  5. Retryable Writes

    View full-size slide

  6. Retryable Writes

    View full-size slide

  7. You’re updating a document

    View full-size slide

  8. You’re updating a document
    db.coll.updateOne(
    { _id: 16 },
    { $inc: { count: 1 }}
    );

    View full-size slide

  9. Murphy’s Law kicks in

    View full-size slide

  10. Is it safe to retry the update?

    View full-size slide

  11. Did our message never
    make it to the server?

    View full-size slide

  12. Did we lose the server’s reply?

    View full-size slide

  13. Did something else happen?

    View full-size slide

  14. Did something else happen?

    View full-size slide

  15. Did something else happen?

    View full-size slide

  16. Did something else happen?

    View full-size slide

  17. Did something else happen?

    View full-size slide

  18. There’s no way to retrieve
    an operation’s state

    View full-size slide

  19. Let’s review some
    best practices
    How To Write Resilient
    MongoDB Applications

    View full-size slide

  20. This was our update…
    db.coll.updateOne(
    { _id: 16 },
    { $inc: { count: 1 }}
    );

    View full-size slide

  21. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    Always
    retry
    Retry
    once

    View full-size slide

  22. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    Always
    retry
    Retry
    once

    View full-size slide

  23. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK
    Always
    retry
    Retry
    once

    View full-size slide

  24. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    Retry
    once

    View full-size slide

  25. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Retry
    once

    View full-size slide

  26. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Retry
    once

    View full-size slide

  27. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once

    View full-size slide

  28. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once
    May overcount

    View full-size slide

  29. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once
    May overcount OK

    View full-size slide

  30. Errors and retry strategies
    Transient
    network error
    Persistent
    outage
    Command
    error
    Never
    retry
    May
    undercount
    OK OK
    Always
    retry
    May overcount
    Wastes
    time
    Wastes
    time
    Retry
    once
    May overcount OK OK

    View full-size slide

  31. There’s no good solution for
    transient network errors

    View full-size slide

  32. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once

    View full-size slide

  33. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once
    OK

    View full-size slide

  34. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once
    OK OK

    View full-size slide

  35. We can safely retry
    idempotent operations
    Transient
    network error
    Persistent
    outage
    Command
    error
    Retry
    once
    OK OK OK

    View full-size slide

  36. Safe-to-retry inserts
    db.coll.insertOne(
    { _id: 18, name: "Alice" }
    );

    View full-size slide

  37. Safe-to-retry deletes
    db.coll.deleteOne(
    { _id: 20 }
    );
    db.coll.deleteMany(
    { status: "inactive" }
    );

    View full-size slide

  38. Safe-to-retry updates
    db.coll.updateOne(
    { _id: 22 },
    { $set: { status: "active" }}
    );

    View full-size slide

  39. Why can’t we retrieve
    an operation’s state?

    View full-size slide

  40. In MongoDB 3.4, state is
    tied to connection objects

    View full-size slide

  41. MongoDB 3.6 introduces
    logical sessions

    View full-size slide

  42. MongoDB 3.6 introduces
    logical sessions
    Sessions allow us to maintain cluster-wide
    state about the user and their operations.

    View full-size slide

  43. MongoDB 3.6 introduces
    logical sessions
    Sessions allow us to maintain cluster-wide
    state about the user and their operations.
    Sessions are not tied to connections.

    View full-size slide

  44. Retrying writes with a session

    View full-size slide

  45. Retrying writes with a session

    View full-size slide

  46. Retrying writes with a session

    View full-size slide

  47. Retrying writes with a session
    update

    View full-size slide

  48. Retrying writes with a session

    View full-size slide

  49. Retrying writes with a session
    update

    View full-size slide

  50. Retrying writes with a session

    View full-size slide

  51. We can trust the server to
    Do the Right Thing™

    View full-size slide

  52. We can trust the server to
    Do the Right Thing™
    If the write already executed,
    return the result we missed.

    View full-size slide

  53. We can trust the server to
    Do the Right Thing™
    If the write already executed,
    return the result we missed.
    If the write never executed,
    do it now and return its result.

    View full-size slide

  54. Sessions are cluster-wide

    View full-size slide

  55. Sessions are cluster-wide
    update

    View full-size slide

  56. Sessions are cluster-wide
    update

    View full-size slide

  57. Sessions are cluster-wide

    View full-size slide

  58. Sessions are cluster-wide

    View full-size slide

  59. Sessions are cluster-wide
    update

    View full-size slide

  60. Taking advantage of
    retryable writes
    ?retryWrites=true
    mongodb://…

    View full-size slide

  61. One down, two to go
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View full-size slide

  62. Zombie Cursor Cleanup

    View full-size slide

  63. Zombie Cursor Cleanup

    View full-size slide

  64. You’re running a long query

    View full-size slide

  65. You’re running a long query
    cursor = db.coll.find();
    cursor.forEach(function() {
    // lengthy processing…
    });

    View full-size slide

  66. You’re running a long query
    cursor = db.coll.find();
    cursor.forEach(function() {
    // lengthy processing…
    });

    View full-size slide

  67. Cursors have a timeout

    View full-size slide

  68. Cursors have a timeout
    A er 10 minutes, the server will
    close a cursor due to inactivity.

    View full-size slide

  69. Cursors have a timeout
    A er 10 minutes, the server will
    close a cursor due to inactivity.
    Issuing a getMore
    resets the clock.

    View full-size slide

  70. Disabling cursor timeouts
    cursor = db.coll.find(
    { },
    { noCursorTimeout: true }
    );
    cursor.forEach(function() {
    // lengthy processing…

    View full-size slide

  71. Disabling cursor timeouts
    cursor = db.coll.find(
    { },
    { noCursorTimeout: true }
    );
    cursor.forEach(function() {
    // lengthy processing…

    View full-size slide

  72. Executing our long query

    View full-size slide

  73. Executing our long query
    find

    View full-size slide

  74. Executing our long query

    View full-size slide

  75. Executing our long query
    getMore

    View full-size slide

  76. Executing our long query

    View full-size slide

  77. Executing our long query
    getMore

    View full-size slide

  78. Executing our long query

    View full-size slide

  79. Executing our long query
    getMore

    View full-size slide

  80. Executing our long query
    getMore

    View full-size slide

  81. Executing our long query
    getMore

    View full-size slide

  82. Executing our long query
    getMore

    View full-size slide

  83. Executing our long query
    getMore

    View full-size slide

  84. Executing our long query
    getMore

    View full-size slide

  85. Executing our long query
    getMore

    View full-size slide

  86. A zombie cursor is born
    > db.serverStatus()
    {
    "metrics": {
    "cursor": {
    "open": {
    "noTimeout": 1,
    "total": 1

    View full-size slide

  87. What happened last night?

    View full-size slide

  88. What happened last night?
    (from the server’s POV)

    View full-size slide

  89. What happened last night?
    (from the server’s POV)

    View full-size slide

  90. What happened last night?
    (from the server’s POV)
    find

    View full-size slide

  91. What happened last night?
    (from the server’s POV)

    View full-size slide

  92. What happened last night?
    (from the server’s POV)
    getMore

    View full-size slide

  93. What happened last night?
    (from the server’s POV)

    View full-size slide

  94. What happened last night?
    (from the server’s POV)
    getMore

    View full-size slide

  95. What happened last night?
    (from the server’s POV)

    View full-size slide

  96. What happened last night?
    (from the server’s POV)

    View full-size slide

  97. What happened last night?
    (from the server’s POV)

    View full-size slide

  98. Avoiding zombie cursors
    with logical sessions

    View full-size slide

  99. Avoiding zombie cursors
    with logical sessions
    Sessions also have a timeout.

    View full-size slide

  100. Avoiding zombie cursors
    with logical sessions
    Sessions also have a timeout.
    We can associate
    queries with a session

    View full-size slide

  101. Querying with a session
    session = client.startSession();
    cursor = db.coll.find(
    { },
    { session: session }
    );

    View full-size slide

  102. Executing our long query

    View full-size slide

  103. Executing our long query

    View full-size slide

  104. Executing our long query

    View full-size slide

  105. Executing our long query
    find

    View full-size slide

  106. Executing our long query

    View full-size slide

  107. Executing our long query
    getMore

    View full-size slide

  108. Executing our long query

    View full-size slide

  109. Executing our long query
    getMore

    View full-size slide

  110. Executing our long query

    View full-size slide

  111. Executing our long query

    View full-size slide

  112. Executing our long query
    session expires

    View full-size slide

  113. Executing our long query

    View full-size slide

  114. Did we just punt on
    the timeout issue?

    View full-size slide

  115. Session timeouts are
    non-negotiable

    View full-size slide

  116. Session timeouts are
    non-negotiable
    Idle sessions will expire.

    View full-size slide

  117. Session timeouts are
    non-negotiable
    Idle sessions will expire.
    Any operation using the
    session resets the clock.

    View full-size slide

  118. Two down, one to go
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View full-size slide

  119. Cluster-wide killOp

    View full-size slide

  120. Cluster-wide killOp

    View full-size slide

  121. You’re running an operation
    that may never complete

    View full-size slide

  122. You’re running an operation
    that may never complete
    cursor = db.coll.find(
    { … } // table scans for days
    );

    View full-size slide

  123. You’ve made a
    terrible mistake

    View full-size slide

  124. Step 1: Find the operation ID
    > db.currentOp()
    {
    "inprog" : [
    {
    "desc" : "conn2",
    "threadId" : "140181791471360",
    "connectionId" : 2,
    "client" : "127.0.0.1:49456",
    "appName" : "MongoDB Shell",
    "active" : true,
    "opid" : 132921,

    View full-size slide

  125. Step 2: Kill the operation ID
    > db.killOp(132921)
    {
    "info": "attempting to kill op",
    "ok": 1
    }

    View full-size slide

  126. Lather, rinse, repeat

    View full-size slide

  127. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")

    View full-size slide

  128. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")
    > db.currentOp()
    {
    "inprog" : [
    // …
    ]
    }

    View full-size slide

  129. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")
    > db.currentOp()
    {
    "inprog" : [
    // …
    ]
    }
    > db.killOp(…)

    View full-size slide

  130. Lather, rinse, repeat
    > connect("mongodb://shard-2.example.com")
    > db.currentOp()
    {
    "inprog" : [
    // …
    ]
    }
    > db.killOp(…)

    View full-size slide

  131. How did this happen?
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  132. How did this happen?
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  133. How did this happen?
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  134. Cluster-wide killOp
    with logical sessions

    View full-size slide

  135. Cluster-wide killOp
    with logical sessions
    Any operation may be
    associated with a session.

    View full-size slide

  136. Cluster-wide killOp
    with logical sessions
    Any operation may be
    associated with a session.
    Terminating a session will end
    all of its associated operations.

    View full-size slide

  137. Terminating a session
    session = client.startSession();
    cursor = db.coll.find(
    { … }, // table scans for days
    { session: session }
    );

    View full-size slide

  138. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  139. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  140. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  141. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  142. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  143. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  144. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  145. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  146. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  147. Querying with sessions
    mongos
    shard 1
    shard 2
    shard 3

    View full-size slide

  148. That’s a wrap
    Retryable writes
    Zombie cursor cleanup
    Cluster-wide killOp

    View full-size slide

  149. One last point

    View full-size slide

  150. Resilence is primarily
    the driver’s domain

    View full-size slide

  151. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring

    View full-size slide

  152. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring
    Elections and failover recovery

    View full-size slide

  153. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring
    Elections and failover recovery
    Load-balancing mongos connections

    View full-size slide

  154. Resilence is primarily
    the driver’s domain
    Server discovery and monitoring
    Elections and failover recovery
    Load-balancing mongos connections
    Routing queries by read preference

    View full-size slide

  155. Addressing resilence
    on the server-side

    View full-size slide

  156. Addressing resilence
    on the server-side
    Tracking operation state

    View full-size slide

  157. Addressing resilence
    on the server-side
    Tracking operation state
    Cluster-wide sessions

    View full-size slide

  158. Providing a relatively
    easy upgrade path

    View full-size slide

  159. Providing a relatively
    easy upgrade path
    No need to rewrite applications

    View full-size slide

  160. Providing a relatively
    easy upgrade path
    No need to rewrite applications
    Opting in to retryable writes

    View full-size slide

  161. Providing a relatively
    easy upgrade path
    No need to rewrite applications
    Opting in to retryable writes
    New API for client session objects

    View full-size slide

  162. Providing a relatively
    easy upgrade path
    No need to rewrite applications
    Opting in to retryable writes
    New API for client session objects
    Pass session option as needed

    View full-size slide

  163. Inside the spec process
    mongodb/specifications

    View full-size slide

  164. Inside the spec process
    /sessions
    mongodb/specifications

    View full-size slide

  165. Inside the spec process
    /sessions
    /retryable-writes
    mongodb/specifications

    View full-size slide

  166. Inside the spec process
    /sessions
    /retryable-writes
    /causal-consistency
    mongodb/specifications

    View full-size slide

  167. Inside the spec process
    /sessions
    /retryable-writes
    /causal-consistency
    /retryable-reads
    mongodb/specifications

    View full-size slide

  168. In the meantime…
    How To Write Resilient
    MongoDB Applications

    View full-size slide