Slide 1

Slide 1 text

Sharding Best Practices June 2012

Slide 2

Slide 2 text

Sharding best practices Solution Architect Based in London http://www.10gen.com/ @dmroberts [email protected]

Slide 3

Slide 3 text

Sharding best practices Ways to scale by optimisation schema design: embed and duplicate indices: critical for performance hardware: RAM, RAID10, SSD (super fast!) by replication scaling reads only application must accept eventual consistency • • • • • • •

Slide 4

Slide 4 text

Sharding best practices To shard or not to shard 3 Reasons to shard: whole data does not fit on one server’s storage working set does not fit in one server’s RAM write volume is too high for one server’s disk Make sure you account for indices and fragmentation! If those won’t apply any time soon, keep your setup simple, no need to shard! • • • • • •

Slide 5

Slide 5 text

Sharding best practices What are the metrics saying? Use Monitoring Tools Mongostat db.serverStatus() iostat MMS - http://mms.10gen.com/ Working Set and Indexes page faults and BTree index misses • • • • • • •

Slide 6

Slide 6 text

Sharding best practices Sharding overview

Slide 7

Slide 7 text

Sharding best practices Dos and Don’ts Do ... run mongoS on each app server as a proxy run config dbs on a micro instance or a mongod server use 3+ node replica sets for each shard Don’t ... run many mongoS behind a load balancer run only 1 config db, or run 3 on the same server • • • • • • •

Slide 8

Slide 8 text

Sharding best practices Picking a shard key The following aspects must be considered: Cardinality: how much data for a single value? Write distribution: how many shards are written to? Query isolation: how many shards will be hit? Reliability: how much of the system is affected by a shard failure? Index locality: how much of the key’s index needs to be in RAM? • • • • • •

Slide 9

Slide 9 text

Sharding best practices Only have to keep small portion in ram Right shard "hot" • • Time Based ObjectId Auto Increment • • • Incremental Right Balanced Access

Slide 10

Slide 10 text

Sharding best practices Have to keep entire index in ram All shards "warm" • • Hash • Random distribution

Slide 11

Slide 11 text

Sharding best practices Have to keep entire index in ram Some shards "warm" • • Month + Hash • Segmented access

Slide 12

Slide 12 text

Sharding best practices Example: email storage Most common scenario, can be applied to 90% cases Each document can be up to 16MB Each user may have GBs of storage Most common query: get user emails sorted by time Index on {_id}, {user, time}, {recipients} • • • • • { _id: ObjectId(), user: 123, time: Date(), subject: “...”, recipients: [], body: “...”, attachments: []}

Slide 13

Slide 13 text

Sharding best practices Example: email storage Cardinality Write scaling Query isolation Reliability Index locality _id Doc level 1 shard all shards index sort + merge sort all users affected Great hash(_id) Hash level All shards all shards index sort + merge sort all users affected Poor user Many docs All shards 1 shard index sort some users affected So-so user, time Doc level All shards 1 shard index sort some users affected Good

Slide 14

Slide 14 text

Sharding best practices Query routing Process is optimized for low latency: mongoS decides which shards are involved sends query to all those shards, pulling 1st batch pulls one shard at a time, or in parallel for merge sort Try to have queries use as few shards as possible Geo-index queries will use all shards • • • • • •

Slide 15

Slide 15 text

Impact on Schema Design { _id : "alvin", display: "jonnyeight", addresses: [ { state : "CA", country: "USA" }, { country: "UK" } ] } Shard on { _id : 1 } Lookup by _id hits 1 node Index on { “addresses.country” : 1 }

Slide 16

Slide 16 text

Multiple Identities - Example User can have multiple identities twitter name email address facebook name etc. What is the best sharding key & schema design? • • • •

Slide 17

Slide 17 text

Multiple Identities - Solution 1 { _id: "alvin", display: "jonnyeight", fb: "alvin.richards", // facebook li: "alvin.j.richards", // linkedin addresses : [ { state : "CA", country: "USA" }, { country: "UK" } ] } Shard on { _id: 1 } Lookup by _id hits 1 node Lookup by li or fb is scatter gather Cannot create a unique index on li or fb

Slide 18

Slide 18 text

Multiple Identities - Solution 2 identities { type: "_id", val: "alvin", info: "1200-42"} { type: "fb", val: "alvin.richards", info: "1200-42"} { type: "li", val: "alvin.j.richards",info: "1200-42"} info { _id: "1200-42", addresses : [ { state : "CA", country: "USA" }, { country: "UK" }] } Shard identities on { type : 1, val : 1 } Lookup by type & val hits 1 node Can create unique index on type & val Shard info on { _id: 1 } Lookup info on _id hits one node

Slide 19

Slide 19 text

Sharding best practices Splitting Splitting is automatically triggered by a mongoS if a chunk goes beyond the max size It is a fairly inexpensive operation that scans a portion of the index target chunk size (64MB default) can be changed with progressive effect splitting can be triggered manually using split pre-splitting is strongly encouraged if shard key is mostly increasing • • • • •

Slide 20

Slide 20 text

Sharding best practices Balancing Triggered automatically by a mongoS if there is a chunk imbalance (8+ chunks) Usually involves moving the oldest chunk from the largest shard to the smallest shard (page faults..) Migration can be fairly expensive: non-sequential disk scan large number inserts & deletes • • • • •

Slide 21

Slide 21 text

Sharding best practices Balancing tips Run the balancer at low traffic times db.settings.update({_id: “balancer”}, { $set: { activeWindow: { start: “9:00”, stop: “21:00” }}) Can be triggered manually using moveChunk • • •

Slide 22

Slide 22 text

Sharding best practices Indexing Each mongod applies normal indexing on owned data Each mongoS has an in-memory btree of the chunks Secondary indexing is mostly intact! Cannot enforce unique though. • • •

Slide 23

Slide 23 text

Sharding best practices Indexing S1 S2 S3 S4 mongoS { a: 1} { a: 1} { a: 1} { a: 1} { b: 1 } { b: 1 } { b: 1 } { b: 1 } Shard key on a Secondary index on b • • chunks on a

Slide 24

Slide 24 text

Sharding best practices Removing capacity To remove a shard: use the removeshard command via mongos. This will drain all the chunks. When chunks are moved use movePrimary on the shards that use it as default. Run a final removeshard to remove all metadata. To remove a replica, administer that replica directly • • • • •

Slide 25

Slide 25 text

Sharding best practices Backup Process Process is: Stop the balancer sh.setBalancerState(false) Check no current migration sh.isBalancerRunning() Backup 1 config server (they’re identical!) Backup each shard, following the replica set procedure Restart the balancer Do backup at the end of balancing rounds, for large clusters • • • • • • • •

Slide 26

Slide 26 text

Sharding best practices The future shard key hashing (v2.2) content will be optimally distributed across shards geo-aware sharding (v2.2) a shard is marked as belonging to region (e.g. US) • • • •

Slide 27

Slide 27 text

Sharding best practices Summary Shard if you need to - what do the metrics say? Choose shard key wisely - not easily changed Shard early as resources will be required • • •