Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hash-Based Sharding in MongoDB 2.4 (MongoDB SF 2013)

Hash-Based Sharding in MongoDB 2.4 (MongoDB SF 2013)

Presentation on the hash-based sharding features of MongoDB given at MongoDB SF 2013.

29392a12bce98d5f0de66043d17f378b?s=128

Brandon Black

May 10, 2013
Tweet

Transcript

  1. Software Engineer, 10gen @brandonmblack Brandon Black #MongoDBDays Hash-Based Sharding in

    MongoDB 2.4
  2. Agenda •  Mechanics of Sharding –  Key space –  Chunks

    –  Balancing •  Request Routing •  Hashed Shard Keys –  Why use hashed shard keys –  How to enable hashed shard keys –  Limitations
  3. Sharded Cluster Node 1 Secondary Config Server Node 1 Secondary

    Config Server Node 1 Secondary Config Server Shard Shard Shard Mongos App Server Mongos App Server Mongos App Server
  4. Sharding Your Data

  5. What Is A Shard Key? •  Shard key is used

    to partition your collection •  Shard key must exist in every document •  Shard key is immutable •  Shard key values are immutable •  Shard key must be indexed •  Shard key is used to route requests to shards
  6. The Key Space {x: 10}! -∞ +∞ Key Space {x:

    -5}! {x: -9}!{x: 7}!{x: 6}!{x: 0}!
  7. -∞ +∞ Key Space Inserting Data {x: 0}! {x: 6}!

    {x: 7}! {x: -5}! {x: 10}! {x: -9}!
  8. -∞ +∞ Key Space Inserting Data {x: 0}! {x: 6}!

    {x: 7}! {x: -5}! {x: 10}! {x: -9}!
  9. Chunk Range and Size minKey maxKey minKey maxKey minKey maxKey

    {x: -20} {x: 13} {x: 25} {x: 100,000} 64MB -∞ +∞ Key Space {x: 0}! {x: 6}! {x: 7}! {x: -5}! {x: 10}! {x: -9}!
  10. Inserting Further Data -∞ +∞ Key Space {x: 0}! {x:

    6}! {x: 7}! {x: -5}! {x: 10}! {x: -9}! {x: 9}! {x: -7}! {x: 3}!
  11. Chunk Splitting minKey maxKey minKey 13 14 maxKey -∞ +∞

    Key Space {x: 0}! {x: 6}! {x: 7}! {x: -5}! {x: 10}! {x: -9}! 0 0 •  A chunk is split once it exceeds the maximum size •  There is no split point if all documents have the same shard key •  Chunk split is a logical operation (no data is moved) •  If split creates too large of a discrepancy of chunk count across cluster a balancing round starts
  12. Data Distribution Node 1 Secondary Config Server Shard 1 Mongos

    Mongos Mongos Shard 2 Mongod •  MinKey to 0 lives on Shard1 •  0 to MaxKey lives on Shard2 •  Mongos routes queries appropriately
  13. Mongos Routes Data Node 1 Secondary Config Server Shard 1

    Mongos Mongos Mongos Shard 2 Mongod minKey à 0 0 à maxKey db.test.insert({ x: -1000 })!
  14. Mongos Routes Data Node 1 Secondary Config Server Shard 1

    Mongos Mongos Mongos Shard 2 Mongod minKey à 0 0 à maxKey db.test.insert({ x: -1000 })!
  15. Unbalanced Shards Node 1 Secondary Config Server Shard 1 Mongos

    Mongos Mongos Shard 2 Mongod minKey à 0 0 à maxKey
  16. Balancing Node 1 Secondary Config Server Shard 1 Mongos Mongos

    Mongos Shard 2 Mongod •  Migration threshold •  Number of chunks less than 20, migration threshold of 2 •  21-80, migration threshold 4 •  >80, migration threshold 8
  17. Moving the chunk Node 1 Secondary Config Server Mongos Shard

    1 Shard 2 Mongod •  One chunk of data is copied from Shard 1 to Shard 2
  18. Committing Migration •  Once everyone agrees the data has moved,

    that chunk gets deleted from Shard 1. Node 1 Secondary Config Server Mongos Shard 1 Shard 2 Mongod
  19. Cleanup •  Other mongos' have to find out about new

    configuration Node 1 Secondary Config Server Shard 1 Shard 2 Mongod Mongos Mongos Mongos
  20. Effects of Migrations •  Expensive •  Can take a long

    time •  Competes for limited resources Node 1 Secondary Config Server Shard 1 Shard 2 Mongod Mongos Mongos Mongos
  21. Picking A Shard Key •  Cardinality •  Optimize routing • 

    Minimize (unnecessary) traffic •  Allow best scaling
  22. Routing Requests

  23. Cluster Request Routing •  Targeted Queries •  Scatter Gather Queries

    •  Scatter Gather Queries with Sort
  24. Cluster Request Routing: Targeted Query Shard Shard Shard Mongos

  25. Routable Request Received Shard Shard Shard Mongos 1

  26. Request routed to appropriate shard Shard Shard Shard Mongos 1

    2
  27. Shard returns results Shard Shard Shard Mongos 1 2 3

  28. Mongos returns results to client Shard Shard Shard Mongos 1

    2 3 4
  29. Cluster Request Routing: Non-Targeted Query Shard Shard Shard Mongos

  30. Non-Targeted Request Received Shard Shard Shard Mongos 1

  31. Request sent to all shards Shard Shard Shard Mongos 1

    2 2 2
  32. Shards return results to mongos Shard Shard Shard Mongos 1

    2 2 2 3 3 3
  33. Mongos returns results to client Shard Shard Shard Mongos 1

    2 2 2 3 3 3 4
  34. Cluster Request Routing: Non-Targeted Query with Sort Shard Shard Shard

    Mongos
  35. Non-Targeted request with sort received Shard Shard Shard Mongos 1

  36. Request sent to all shards Shard Shard Shard Mongos 1

    2 2 2
  37. Query and sort performed locally Shard Shard Shard Mongos 1

    2 2 2 3 3 3
  38. Shards return results to mongos Shard Shard Shard Mongos 1

    2 2 2 4 4 4 3 3 3
  39. Mongos merges sorted results Shard Shard Shard Mongos 1 2

    2 2 4 4 4 3 3 3 5
  40. Mongos returns results to client Shard Shard Shard Mongos 1

    2 2 2 4 4 4 3 3 3 6 5
  41. What About ObjectId? ObjectId("51597ca8e28587b86528edfd”)! •  Used for _id •  12

    byte value •  Generated by the driver if not specified •  Theoretically globally unique
  42. What About ObjectId? ObjectId("51597ca8e28587b86528edfd”)! 12 Bytes Timestamp MAC PID Counter

  43. // enabling sharding on test database mongos> sh.enableSharding("test") { "ok"

    : 1 } // sharding the test collection mongos> sh.shardCollection("test.test",{_id:1}) { "collectionsharded" : "test.test", "ok" : 1 } // create a loop inserting data mongos> for (x=0; x<10000; x++) { ... db.test.insert({value:x}) ... } Sharding on ObjectId
  44. shards: { "_id" : "shard0000", "host" : "localhost:30000" } {

    "_id" : "shard0001", "host" : "localhost:30001" } databases: { "_id" : "test", "partitioned" : true, "primary" : "shard0001" } test.test shard key: { "_id" : 1 } chunks: shard0001 3 { "_id" : { "$minKey" : 1 } } -->> { "_id" : ObjectId(”...") } on : shard0001 { "t" : 1000, "i" : 1 } { "_id" : ObjectId(”...”) } -->> { "_id" : { "$maxKey" : 1 } } on : shard0001 { "t" : 1000, "i" : 2 } ObjectId Chunk Distribution
  45. ObjectId Results In A “Hot Shard” Node 1 Secondary Config

    Server Shard 1 Mongos Mongos Mongos Shard 2 Mongod minKey à 0 0 à maxKey
  46. Sharding on incremental values like timestamp is not optimum for

    even distribution
  47. Hashed Shard Keys

  48. Hashed Shard Keys {x:2}! md5! c81e728d9d4c2f636f067f89cc14862c! {x:3}! md5! eccbc87e4b5ce2fe28308fd9f2a7baf3! {x:1}!

    md5! c4ca4238a0b923820dcc509a6f75849b!
  49. Hashed Shard Key Eliminates “Hot Shard” Node 1 Secondary Config

    Server Shard 1 Mongos Mongos Mongos Shard 2 Mongod minKey à 0 0 à maxKey
  50. Under the Hood •  Create a hashed index used for

    sharding •  Uses the first 64-bits of md5 hash of field •  Hash both data and BSON type •  Represented as a NumberLong in the shell
  51. // hash on 1 as an integer > db.runCommand({_hashBSONElement:1}) {

    "key" : 1, "seed" : 0, "out" : NumberLong("5902408780260971510"), "ok" : 1 } // hash on “1” as a string > db.runCommand({_hashBSONElement:"1"}) { "key" : "1", "seed" : 0, "out" : NumberLong("-2448670538483119681"), "ok" : 1 } Hash on both data and BSON type
  52. Enabling Hashed Indexes •  Create index: db.collection.ensureIndex({field : ”hashed”})

  53. Using Hash Shard Keys •  Enable sharding on collection: sh.shardCollection(“test.collection”,{field:

    “hashed”})
  54. // enabling sharding on test database mongos> sh.enableSharding("test") { "ok"

    : 1 } // shard by hashed _id field mongos> sh.shardCollection("test.hash”,{_id:"hashed"}) { "collectionsharded" : "test.hash", "ok" : 1 } Sharding on Hashed ObjectId
  55. databases: { "_id" : "test", "partitioned" : true, "primary" :

    "shard0001" } test.hash shard key: { "_id" : "hashed" } chunks: shard0000 2 shard0001 2 { "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-4611686018427387902") } on : shard0000 { "t" : 2000, "i" : 2 } { "_id" : NumberLong("-4611686018427387902") } --> { "_id" : NumberLong(0) } on : shard0000 { "t" : 2000, "i" : 3 } { "_id" : NumberLong(0) } -->> { "_id" : NumberLong("4611686018427387902") } on : shard0001 { "t" : 2000, "i" : 4 } { "_id" : NumberLong("4611686018427387902") } -->> { "_id" : { "$maxKey" : 1 } } on : shard0001 { "t" : 2000, "i" : 5 } Pre-Splitting the Data
  56. // create a loop inserting data mongos> for (x=0; x<10000;

    x++) { ... db.hash.insert({value:x}) ... } Inserting Into Hashed Shard Key Collection
  57. test.hash shard key: { "_id" : "hashed" } chunks: shard0000

    4 shard0001 4 {"_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong("-7374407069602479355") } on : shard0000 { "t" : 2000, "i" : 8} {"_id" : NumberLong("-7374407069602479355") } -->> { "_id" : NumberLong("-4611686018427387902") } on : shard0000 { "t" : 2000, "i" : 9} {"_id" : NumberLong("-4611686018427387902") } -->> { "_id" : NumberLong("-2456929743513174890") } on : shard0000 { "t" : 2000, "i" : 6} {"_id" : NumberLong("-2456929743513174890") } -->> { "_id" : NumberLong(0) } on : shard0000 { "t" : 2000, "i" : 7} { "_id" : NumberLong(0) } -->> { "_id" : NumberLong("1483539935376971743") } on : shard0001 { "t" : 2000, "i" : 12} Even Distribution of Chunks
  58. Hash Keys Are Great for Equality Queries •  Equality queries

    directed to a specific shard •  Will use the index •  Most efficient query possible
  59. mongos> db.hash.find({x:1}).explain() { "cursor" : "BtreeCursor x_hashed", "n" : 1,

    "nscanned" : 1, "nscannedObjects" : 1, "millisShardTotal" : 0, "numQueries" : 1, "numShards" : 1, "indexBounds" : { "x" : [ [ NumberLong("5902408780260971510"), NumberLong("5902408780260971510") ] ] }, "millis" : 0 } Explain Plan of an Equality Query
  60. Not So Good for a Range Query •  Range queries

    scatter gather •  Don’t use the index •  Inefficient query
  61. mongos> db.hash.find({x:{$gt:1, $lt:99}}).explain() { "cursor" : "BasicCursor", "n" : 97,

    "nChunkSkips" : 0, "nYields" : 0, "nscanned" : 1000, "nscannedAllPlans" : 1000, "nscannedObjects" : 1000, "nscannedObjectsAllPlans" : 1000, "millisShardTotal" : 0, "millisShardAvg" : 0, "numQueries" : 2, "numShards" : 2, "millis" : 3 } Explain Plan of a Range Query
  62. Limitations •  Cannot use a compound key •  Key cannot

    have an array value •  Incompatible with tag aware sharding –  Tags would be assigned the value of the hash, not the value of the underlying key •  Key with poor cardinality is going to give a hash with poor cardinality –  Floating point numbers are squashed. E.g. 100.4 will be hashed as 100
  63. Summary •  There are 3 different approaches for sharding • 

    Hash shard keys give great distribution •  Hash shard keys are good for equality •  Pick the right shard key for your application
  64. #MongoDBDays Thank You Software Engineer, 10gen @brandonmblack Brandon Black