Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MongoDB UK 2012: Sharding Best Practices

mongodb
June 20, 2012
600

MongoDB UK 2012: Sharding Best Practices

Daniel Roberts, 10gen
Sharding allows you to distribute load across multiple servers and keep your data balanced across those servers. This session will review MongoDB’s sharding support, including an architectural overview, design principles, and automation. Also will be presented varied scenarios of handling large amount of data, along with their best sharding strategies.

mongodb

June 20, 2012
Tweet

Transcript

  1. Sharding best practices Ways to scale by optimisation schema design:

    embed and duplicate indices: critical for performance hardware: RAM, RAID10, SSD (super fast!) by replication scaling reads only application must accept eventual consistency • • • • • • •
  2. Sharding best practices To shard or not to shard 3

    Reasons to shard: whole data does not fit on one server’s storage working set does not fit in one server’s RAM write volume is too high for one server’s disk Make sure you account for indices and fragmentation! If those won’t apply any time soon, keep your setup simple, no need to shard! • • • • • •
  3. Sharding best practices What are the metrics saying? Use Monitoring

    Tools Mongostat db.serverStatus() iostat MMS - http://mms.10gen.com/ Working Set and Indexes page faults and BTree index misses • • • • • • •
  4. Sharding best practices Dos and Don’ts Do ... run mongoS

    on each app server as a proxy run config dbs on a micro instance or a mongod server use 3+ node replica sets for each shard Don’t ... run many mongoS behind a load balancer run only 1 config db, or run 3 on the same server • • • • • • •
  5. Sharding best practices Picking a shard key The following aspects

    must be considered: Cardinality: how much data for a single value? Write distribution: how many shards are written to? Query isolation: how many shards will be hit? Reliability: how much of the system is affected by a shard failure? Index locality: how much of the key’s index needs to be in RAM? • • • • • •
  6. Sharding best practices Only have to keep small portion in

    ram Right shard "hot" • • Time Based ObjectId Auto Increment • • • Incremental Right Balanced Access
  7. Sharding best practices Have to keep entire index in ram

    All shards "warm" • • Hash • Random distribution
  8. Sharding best practices Have to keep entire index in ram

    Some shards "warm" • • Month + Hash • Segmented access
  9. Sharding best practices Example: email storage Most common scenario, can

    be applied to 90% cases Each document can be up to 16MB Each user may have GBs of storage Most common query: get user emails sorted by time Index on {_id}, {user, time}, {recipients} • • • • • { _id: ObjectId(), user: 123, time: Date(), subject: “...”, recipients: [], body: “...”, attachments: []}
  10. Sharding best practices Example: email storage Cardinality Write scaling Query

    isolation Reliability Index locality _id Doc level 1 shard all shards index sort + merge sort all users affected Great hash(_id) Hash level All shards all shards index sort + merge sort all users affected Poor user Many docs All shards 1 shard index sort some users affected So-so user, time Doc level All shards 1 shard index sort some users affected Good
  11. Sharding best practices Query routing Process is optimized for low

    latency: mongoS decides which shards are involved sends query to all those shards, pulling 1st batch pulls one shard at a time, or in parallel for merge sort Try to have queries use as few shards as possible Geo-index queries will use all shards • • • • • •
  12. Impact on Schema Design { _id : "alvin", display: "jonnyeight",

    addresses: [ { state : "CA", country: "USA" }, { country: "UK" } ] } Shard on { _id : 1 } Lookup by _id hits 1 node Index on { “addresses.country” : 1 }
  13. Multiple Identities - Example User can have multiple identities twitter

    name email address facebook name etc. What is the best sharding key & schema design? • • • •
  14. Multiple Identities - Solution 1 { _id: "alvin", display: "jonnyeight",

    fb: "alvin.richards", // facebook li: "alvin.j.richards", // linkedin addresses : [ { state : "CA", country: "USA" }, { country: "UK" } ] } Shard on { _id: 1 } Lookup by _id hits 1 node Lookup by li or fb is scatter gather Cannot create a unique index on li or fb
  15. Multiple Identities - Solution 2 identities { type: "_id", val:

    "alvin", info: "1200-42"} { type: "fb", val: "alvin.richards", info: "1200-42"} { type: "li", val: "alvin.j.richards",info: "1200-42"} info { _id: "1200-42", addresses : [ { state : "CA", country: "USA" }, { country: "UK" }] } Shard identities on { type : 1, val : 1 } Lookup by type & val hits 1 node Can create unique index on type & val Shard info on { _id: 1 } Lookup info on _id hits one node
  16. Sharding best practices Splitting Splitting is automatically triggered by a

    mongoS if a chunk goes beyond the max size It is a fairly inexpensive operation that scans a portion of the index target chunk size (64MB default) can be changed with progressive effect splitting can be triggered manually using split pre-splitting is strongly encouraged if shard key is mostly increasing • • • • •
  17. Sharding best practices Balancing Triggered automatically by a mongoS if

    there is a chunk imbalance (8+ chunks) Usually involves moving the oldest chunk from the largest shard to the smallest shard (page faults..) Migration can be fairly expensive: non-sequential disk scan large number inserts & deletes • • • • •
  18. Sharding best practices Balancing tips Run the balancer at low

    traffic times db.settings.update({_id: “balancer”}, { $set: { activeWindow: { start: “9:00”, stop: “21:00” }}) Can be triggered manually using moveChunk • • •
  19. Sharding best practices Indexing Each mongod applies normal indexing on

    owned data Each mongoS has an in-memory btree of the chunks Secondary indexing is mostly intact! Cannot enforce unique though. • • •
  20. Sharding best practices Indexing S1 S2 S3 S4 mongoS {

    a: 1} { a: 1} { a: 1} { a: 1} { b: 1 } { b: 1 } { b: 1 } { b: 1 } Shard key on a Secondary index on b • • chunks on a
  21. Sharding best practices Removing capacity To remove a shard: use

    the removeshard command via mongos. This will drain all the chunks. When chunks are moved use movePrimary on the shards that use it as default. Run a final removeshard to remove all metadata. To remove a replica, administer that replica directly • • • • •
  22. Sharding best practices Backup Process Process is: Stop the balancer

    sh.setBalancerState(false) Check no current migration sh.isBalancerRunning() Backup 1 config server (they’re identical!) Backup each shard, following the replica set procedure Restart the balancer Do backup at the end of balancing rounds, for large clusters • • • • • • • •
  23. Sharding best practices The future shard key hashing (v2.2) content

    will be optimally distributed across shards geo-aware sharding (v2.2) a shard is marked as belonging to region (e.g. US) • • • •
  24. Sharding best practices Summary Shard if you need to -

    what do the metrics say? Choose shard key wisely - not easily changed Shard early as resources will be required • • •