Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoNYC 2012: Sharding Best Practices (Advanced Session)

mongodb
May 29, 2012
1.6k

MongoNYC 2012: Sharding Best Practices (Advanced Session)

MongoNYC 2012: Sharding Best Practices (Advanced Session), Antoine Girbal, 10gen. Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation. Also will be presented varied scenarios of handling large amount of data, along with their best sharding strategies.

mongodb

May 29, 2012
Tweet

Transcript

  1. Sharding best practices Ways to scale by optimization schema design:

    embed and duplicate indices: critical for performance hardware: RAM, RAID10, SSD (super fast!) by replication scaling reads only application must accept eventual consistency • • • • • • •
  2. Sharding best practices To shard or not to shard 3

    Reasons to shard: whole data does not fit on one server’s storage working set does not fit in one server’s RAM write volume is too high for one server’s disk Make sure you account for indices and fragmentation! If those won’t apply any time soon, keep your setup simple, no need to shard! • • • • • •
  3. Sharding best practices Sharding is ... a complex operation for

    the db cluster... made easy and transparent to the app ... but must be well understood for best results! • • •
  4. Sharding best practices Dos and Don’ts Do ... run mongoS

    on each app server as a proxy run config dbs on a micro instance or a mongod server use 3+ node replica sets for each shard Don’t ... run many mongoS behind a load balancer run only 1 config db, or run 3 on the same server • • • • • • •
  5. Sharding best practices Picking a shard key The following aspects

    must be considered: Cardinality: how much data for a single value? Write distribution: how many shards are written to? Query isolation: how many shards will be hit? Reliability: how much of the system is affected by a shard failure? Index locality: how much of the key’s index needs to be in RAM? • • • • • •
  6. Sharding best practices Example: email storage Most common scenario, can

    be applied to 90% cases Each document can be up to 16MB Each user may have GBs of storage Most common query: get user emails sorted by time Index on {_id}, {user, time}, {recipients} • • • • • { _id: ObjectId(), user: 123, time: Date(), subject: “...”, recipients: [], body: “...”, attachments: []}
  7. Sharding best practices Example: email storage Cardinality Write scaling Query

    isolation Reliability Index locality _id Doc level 1 shard all shards index sort + merge sort all users affected Great hash(_id) Hash level All shards all shards index sort + merge sort all users affected Poor user Many docs All shards 1 shard index sort some users affected So-so user, time Doc level All shards 1 shard index sort some users affected Good
  8. Sharding best practices Query routing Process is optimized for low

    latency: mongoS decides which shards are involved sends query to all those shards, pulling 1st batch pulls one shard at a time, or in parallel for merge sort Try to have queries use as few shards as possible Geo-index queries will use all shards • • • • • •
  9. Sharding best practices Routing tips Use slaveOk to route read

    queries to the secondary servers The secondary is not yet selected by latency, but will be in v2.2. This will allow distant DCs to do reads locally. Should use multi=true for update operations, all matching objects are updated updates either use a single shard, or are broadcasted to all shards. • • • •
  10. Sharding best practices Splitting Splitting is automatically triggered by a

    mongoS if a chunk goes beyond the max size It is a fairly inexpensive operation that scans a portion of the index target chunk size (64MB default) can be changed with progressive effect splitting can be triggered manually using split pre-splitting is strongly encouraged if shard key is mostly increasing • • • • •
  11. Sharding best practices Pre-splitting Should be done before a large

    data import, when key range is known If the key is mostly increasing, to spread writes across shards Disable the balancer and let the balancer move chunks, or combine with a moveChunk call • • • mongos> for (var x = 97; x < 97 + 26; x++) { ... var prefix = String.fromCharCode(x); ... db.adminCommand({split: "test.sharded", middle: {email: prefix}}); ... } { "ok" : 1 } mongos> db.getSiblingDB("config").chunks.find({ns: "test.sharded"}) { "_id" : "test.sharded-email_MinKey", "lastmod" : { "t" : 2000, "i" : 0 }, "ns" : "test.sharded", "min" : { "email" : { $minKey : 1 } }, "max" : { "email" : "a" }, "shard" : "shard0001" } { "_id" : "test.sharded-email_\"a\"", "lastmod" : { "t" : 3000, "i" : 0 }, "ns" : "test.sharded", "min" : { "email" : "a" }, "max" : { "email" : "b" }, "shard" : "shard0001" } { "_id" : "test.sharded-email_\"b\"", "lastmod" : { "t" : 4000, "i" : 0 }, "ns" : "test.sharded", "min" : { "email" : "b" }, "max" : { "email" : "c" }, "shard" : "shard0001" } ...
  12. Sharding best practices Balancing Triggered automatically by a mongoS if

    there is a chunk imbalance (8+ chunks) Usually involves moving the oldest chunk from the largest shard to the smallest shard (page faults..) Migration can be fairly expensive: non-sequential disk scan duplication of writes on 2 shard large number inserts & deletes • • • • • •
  13. Sharding best practices Balancing tips Run the balancer at low

    traffic times db.settings.update({_id: “balancer”}, { $set: { activeWindow: { start: “9:00”, stop: “21:00” }}) Can be triggered manually using moveChunk feel free to write your own smart balancer! • • • •
  14. Sharding best practices Indexing Each mongod applies normal indexing on

    owned data Each mongoS has an in-memory btree of the chunks Secondary indexing is mostly intact! Cannot enforce unique though. • • •
  15. Sharding best practices Indexing S1 S2 S3 S4 mongoS {

    a: 1} { a: 1} { a: 1} { a: 1} { b: 1 } { b: 1 } { b: 1 } { b: 1 } Shard key on a Secondary index on b • • chunks on a
  16. Sharding best practices Adding capacity To add a shard, use

    the addshard command via mongos Chunks will be added one at a time - can take a while! To add a replica, administer that replica directly Do add as a replica set have shards of same capacity use hostnames instead of IPs • • • • • • •
  17. Sharding best practices Removing capacity To remove a shard: use

    the removeshard command via mongos. This will drain all the chunks. When chunks are moved use movePrimary on the shards that use it as default. Run a final removeshard to remove all metadata. To remove a replica, administer that replica directly • • • • •
  18. Sharding best practices Backup Process Process is: Stop the balancer

    sh.setBalancerState(false) Check no current migration sh.isBalancerRunning() Backup 1 config server (they’re identical!) Backup each shard, following the replica set procedure Restart the balancer Do backup at the end of balancing rounds, for large clusters • • • • • • • •
  19. Sharding best practices Gotchas deleting and recreating a sharded collection

    does not work well.. Either: plan on restarting all mongos version the collection name (e.g. foo_123) • • •
  20. Sharding best practices The future shard key hashing (v2.2) index

    is now created with hashed flag shard key is declared with hashed flag content will be optimally distributed across shards cannot enforce unique and range queries are not optimized geo-aware sharding (v2.2) a shard is marked as belonging to region (e.g. US) • • • • • • •
  21. Sharding best practices Q & A Thanks for your attention!

    Download and scale MongoDB! Come see 10Gen at “ask the experts” • • •