Slide 1

Slide 1 text

High availability and Scaling MongoDB

Slide 2

Slide 2 text

Agenda •  Replica Sets Lifecycle •  Developing with Replica Sets •  Operational Considerations •  How Replication works

Slide 3

Slide 3 text

Why Replication? •  How many have faced node failures? •  How many have been woken up from sleep to do a fail-over(s)? •  How many have experienced issues due to network latency? •  Different uses for data –  Normal processing –  Simple analytics

Slide 4

Slide 4 text

ReplicaSet Lifecycle

Slide 5

Slide 5 text

Node 1 Node 2 Node 3 Replica Set – Creation

Slide 6

Slide 6 text

Node 1 Secondary Node 2 Secondary Node 3 Primary Replication Heartbeat Replication Replica Set – Initialize

Slide 7

Slide 7 text

Node 1 Secondary Node 2 Secondary Node 3 Heartbeat Primary Election Replica Set – Failure

Slide 8

Slide 8 text

Node 1 Secondary Node 2 Primary Node 3 Replication Heartbeat Replica Set – Failover

Slide 9

Slide 9 text

Node 1 Secondary Node 2 Primary Replication Heartbeat Node 3 Recovery Replication Replica Set – Recovery

Slide 10

Slide 10 text

Node 1 Secondary Node 2 Primary Replication Heartbeat Node 3 Secondary Replication Replica Set – Recovered

Slide 11

Slide 11 text

ReplicaSet Roles & Configuration

Slide 12

Slide 12 text

Node 1 Secondary Node 2 Arbiter Node 3 Primary Heartbeat Replication Replica Set Roles

Slide 13

Slide 13 text

> conf = { _id : "mySet", members : [ {_id : 0, host : "A", priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C"}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf) Configuration Options

Slide 14

Slide 14 text

> conf = { _id : "mySet", members : [ {_id : 0, host : "A", priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C"}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf) Configuration Options Primary DC

Slide 15

Slide 15 text

> conf = { _id : "mySet", members : [ {_id : 0, host : "A", priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C"}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf) Configuration Options Secondary DC Default Priority = 1

Slide 16

Slide 16 text

> conf = { _id : "mySet", members : [ {_id : 0, host : "A", priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C"}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf) Configuration Options Analytics node

Slide 17

Slide 17 text

Backup node > conf = { _id : "mySet", members : [ {_id : 0, host : "A", priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C"}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf) Configuration Options

Slide 18

Slide 18 text

Developing with Replica Sets

Slide 19

Slide 19 text

Secondary Secondary Primary Client Application Driver Write Read Strong Consistency

Slide 20

Slide 20 text

Secondary Secondary Primary Client Application Driver Write Read Read Delayed Consistency

Slide 21

Slide 21 text

Write Concern •  Network acknowledgement •  Wait for error •  Wait for journal sync •  Wait for replication

Slide 22

Slide 22 text

write Driver Primary apply in memory Unacknowledged

Slide 23

Slide 23 text

Driver Primary apply in memory getLastError MongoDB Acknowledged (wait for error)

Slide 24

Slide 24 text

Driver Primary write to journal apply in memory getLastError write j:true Wait for Journal Sync

Slide 25

Slide 25 text

Driver Primary Secondary getLastError write w:2 replicate apply in memory Wait for Replication

Slide 26

Slide 26 text

Tagging •  New in 2.0.0 •  Control where data is written to, and read from •  Each member can have one or more tags –  tags: {dc: "ny"} –  tags: {dc: "ny", subnet: "192.168", rack: "row3rk7"} •  Replica set defines rules for write concerns •  Rules can change without changing app code

Slide 27

Slide 27 text

{ _id : "mySet", members : [ {_id : 0, host : "A", tags : {"dc": "ny"}}, {_id : 1, host : "B", tags : {"dc": "ny"}}, {_id : 2, host : "C", tags : {"dc": "sf"}}, {_id : 3, host : "D", tags : {"dc": "sf"}}, {_id : 4, host : "E", tags : {"dc": "cloud"}}], settings : { getLastErrorModes : { allDCs : {"dc" : 3}, someDCs : {"dc" : 2}} } } > db.blogs.insert({...}) > db.runCommand({getLastError : 1, w : "someDCs"}) Tagging Example

Slide 28

Slide 28 text

Driver Primary (SF) Secondary (NY) getLastError write W:allDCs Secondary (Cloud) replicate replicate apply in memory Wait for Replication (Tagging)

Slide 29

Slide 29 text

Read Preference Modes •  5 modes (new in 2.2) –  primary (only) - Default –  primaryPreferred –  secondary –  secondaryPreferred –  Nearest When more than one node is possible, closest node is used for reads (all modes but primary)

Slide 30

Slide 30 text

Tagged Read Preference •  Custom read preferences •  Control where you read from by (node) tags –  E.g. { "disk": "ssd", "use": "reporting" } •  Use in conjunction with standard read preferences –  Except primary

Slide 31

Slide 31 text

Operational Considerations

Slide 32

Slide 32 text

Maintenance and Upgrade •  No downtime •  Rolling upgrade/maintenance –  Start with Secondary –  Primary last

Slide 33

Slide 33 text

•  Single datacenter •  Single switch & power •  Points of failure: –  Power –  Network –  Data center –  Two node failure •  Automatic recovery of single node crash Replica Set – 1 Data Center Datacenter 2 Datacenter Member 1 Member 2 Member 3

Slide 34

Slide 34 text

•  Multi data center •  DR node for safety •  Can’t do multi data center durable write safely since only 1 node in distant DC Replica Set – 2 Data Centers Member 3 Datacenter 2 Member 1 Member 2 Datacenter 1

Slide 35

Slide 35 text

•  Three data centers •  Can survive full data center loss •  Can do w= { dc : 2 } to guarantee write in 2 data centers (with tags) Replica Set – 3 Data Centers Datacenter 1 Member 1 Member 2 Datacenter 2 Member 3 Member 4 Datacenter 3 Member 5

Slide 36

Slide 36 text

How does it work?

Slide 37

Slide 37 text

Implementation details •  Heartbeat every 2 seconds –  Times out in 10 seconds •  Local DB (not replicated) –  system.replset –  oplog.rs •  Capped collection •  Idempotent version of operation stored

Slide 38

Slide 38 text

> db.replsettest.insert({_id:1,value:1}) { "ts" : Timestamp(1350539727000, 1), "h" : NumberLong("6375186941486301201"), "op" : "i", "ns" : "test.replsettest", "o" : { "_id" : 1, "value" : 1 } } > db.replsettest.update({_id:1},{$inc:{value:10}}) { "ts" : Timestamp(1350539786000, 1), "h" : NumberLong("5484673652472424968"), "op" : "u", "ns" : "test.replsettest", "o2" : { "_id" : 1 }, "o" : { "$set" : { "value" : 11 } } } Op(erations) Log is idempotent

Slide 39

Slide 39 text

> db.replsettest.update({},{$set:{name : "foo"}, false, true}) { "ts" : Timestamp(1350540395000, 1), "h" : NumberLong("-4727576249368135876"), "op" : "u", "ns" : "test.replsettest", "o2" : { "_id" : 2 }, "o" : { "$set" : { "name" : "foo" } } } { "ts" : Timestamp(1350540395000, 2), "h" : NumberLong("-7292949613259260138"), "op" : "u", "ns" : "test.replsettest", "o2" : { "_id" : 3 }, "o" : { "$set" : { "name" : "foo" } } } { "ts" : Timestamp(1350540395000, 3), "h" : NumberLong("-1888768148831990635"), "op" : "u", "ns" : "test.replsettest", "o2" : { "_id" : 1 }, "o" : { "$set" : { "name" : "foo" } } } Single operation can have many entries

Slide 40

Slide 40 text

Replica sets •  Use replica sets •  Easy to setup –  Try on a single machine •  Check doc page for RS tutorials –  http://docs.mongodb.org/manual/replication/#tutorials

Slide 41

Slide 41 text

Sharding

Slide 42

Slide 42 text

Agenda •  Why shard •  MongoDB's approach •  Architecture •  Configuration •  Mechanics •  Solutions

Slide 43

Slide 43 text

Why shard?

Slide 44

Slide 44 text

Working Set Exceeds Physical Memory

Slide 45

Slide 45 text

Read/Write Throughput Exceeds I/O

Slide 46

Slide 46 text

MongoDB's approach to sharding

Slide 47

Slide 47 text

Partition data based on ranges •  User defines shard key •  Shard key defines range of data •  Key space is like points on a line •  Range is a segment of that line -∞ +∞ Key Space

Slide 48

Slide 48 text

Distribute data in chunks across nodes •  Initially 1 chunk •  Default max chunk size: 64mb •  MongoDB automatically splits & migrates chunks when max reached Node 1 Secondary Config Server Shard 1 Mongos Mongos Mongos Shard 2 Mongod

Slide 49

Slide 49 text

MongoDB manages data •  Queries routed to specific shards •  MongoDB balances cluster •  MongoDB migrates data to new nodes Shard Shard Shard Mongos 1 2 3 4

Slide 50

Slide 50 text

MongoDB Auto-Sharding •  Minimal effort required –  Same interface as single mongod •  Two steps –  Enable Sharding for a database –  Shard collection within database

Slide 51

Slide 51 text

Architecture

Slide 52

Slide 52 text

Data stored in shard •  Shard is a node of the cluster •  Shard can be a single mongod or a replica set Shard Primary Secondary Secondary Shard or Mongod

Slide 53

Slide 53 text

Config server stores meta data •  Config Server –  Stores cluster chunk ranges and locations –  Can have only 1 or 3 (production must have 3) –  Two phase commit (not a replica set) or Node 1 Secondary Config Server Node 1 Secondary Config Server Node 1 Secondary Config Server Node 1 Secondary Config Server

Slide 54

Slide 54 text

MongoS manages the data •  Mongos –  Acts as a router / balancer –  No local data (persists to config database) –  Can have 1 or many App Server Mongos Mongos App Server App Server App Server Mongos or

Slide 55

Slide 55 text

Node 1 Secondary Config Server Node 1 Secondary Config Server Node 1 Secondary Config Server Shard Shard Shard Mongos App Server Mongos App Server Mongos App Server Sharding infrastructure

Slide 56

Slide 56 text

Configuration

Slide 57

Slide 57 text

Example cluster setup •  Don’t use this setup in production! -  Only one Config server (No Fault Tolerance) -  Shard not in a replica set (Low Availability) -  Only one Mongos and shard (No Performance Improvement) -  Useful for development or demonstrating configuration mechanics Node 1 Secondary Config Server Mongos Mongod Mongod

Slide 58

Slide 58 text

Node 1 Secondary Config Server Start the config server •  "mongod --configsvr" •  Starts a config server on the default port (27019)

Slide 59

Slide 59 text

Node 1 Secondary Config Server Mongos Start the mongos router •  "mongos --configdb :27019" •  For 3 config servers: "mongos --configdb :,:,:" •  This is always how to start a new mongos, even if the cluster is already running

Slide 60

Slide 60 text

Start the shard database •  "mongod --shardsvr" •  Starts a mongod with the default shard port (27018) •  Shard is not yet connected to the rest of the cluster •  Shard may have already been running in production Node 1 Secondary Config Server Mongos Mongod Shard

Slide 61

Slide 61 text

Add the shard •  On mongos: "sh.addShard(‘:27018’)" •  Adding a replica set: "sh.addShard(‘/’) •  In 2.2 and later can use sh.addShard(‘:’) Node 1 Secondary Config Server Mongos Mongod Shard

Slide 62

Slide 62 text

Verify that the shard was added •  db.runCommand({ listshards:1 }) •  { "shards" : [ { "_id": "shard0000", "host": ":27018" } ], "ok" : 1 } Node 1 Secondary Config Server Mongos Mongod Shard

Slide 63

Slide 63 text

Enabling Sharding •  Enable sharding on a database –  sh.enableSharding("") •  Shard a collection with the given key –  sh.shardCollection(".people",{"country":1}) •  Use a compound shard key to prevent duplicates –  sh.shardCollection(".cars",{"year":1, "uniqueid":1})

Slide 64

Slide 64 text

Tag Aware Sharding •  Tag aware sharding allows you to control the distribution of your data •  Tag a range of shard keys –  sh.addTagRange(,,,) •  Tag a shard –  sh.addShardTag(,)

Slide 65

Slide 65 text

Mechanics

Slide 66

Slide 66 text

Partitioning •  Remember it's based on ranges -∞ +∞ Key Space

Slide 67

Slide 67 text

minKey maxKey minKey maxKey minKey maxKey {x: -20} {x: 13} {x: 25} {x: 100,000} 64MB Chunk is a section of the entire range

Slide 68

Slide 68 text

Chunk splitting •  A chunk is split once it exceeds the maximum size •  There is no split point if all documents have the same shard key •  Chunk split is a logical operation (no data is moved) •  If split creates too large of a discrepancy of chunk count across cluster a balancing round starts minKey maxKey minKey 13 14 maxKey

Slide 69

Slide 69 text

Balancing •  Balancer is running on mongos •  Once the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts Node 1 Secondary Config Server Shard 1 Mongos Mongos Mongos Shard 2 Mongod

Slide 70

Slide 70 text

Acquiring the Balancer Lock •  The balancer on mongos takes out a "balancer lock" •  To see the status of these locks: -  use config -  db.locks.find({ _id: "balancer" }) Node 1 Secondary Config Server Mongos Shard 1 Mongos Mongos Shard 2 Mongod

Slide 71

Slide 71 text

Moving the chunk •  The mongos sends a "moveChunk" command to source shard •  The source shard then notifies destination shard •  The destination claims the chunk shard-key range •  Destination shard starts pulling documents from source shard Node 1 Secondary Config Server Mongos Shard 1 Shard 2 Mongod

Slide 72

Slide 72 text

Committing Migration •  When complete, destination shard updates config server -  Provides new locations of the chunks Node 1 Secondary Config Server Mongos Shard 1 Shard 2 Mongod

Slide 73

Slide 73 text

Cleanup •  Source shard deletes moved data -  Must wait for open cursors to either close or time out -  NoTimeout cursors may prevent the release of the lock •  Mongos releases the balancer lock after old chunks are deleted Node 1 Secondary Config Server Shard 1 Shard 2 Mongod Mongos Mongos Mongos

Slide 74

Slide 74 text

Routing Requests

Slide 75

Slide 75 text

Cluster Request Routing •  Targeted Queries •  Scatter Gather Queries •  Scatter Gather Queries with Sort

Slide 76

Slide 76 text

Shard Shard Shard Mongos Cluster Request Routing: Targeted Query

Slide 77

Slide 77 text

Shard Shard Shard Mongos 1 Routable request received

Slide 78

Slide 78 text

Shard Shard Shard Mongos 1 2 Request routed to appropriate shard

Slide 79

Slide 79 text

Shard Shard Shard Mongos 1 2 3 Shard returns results

Slide 80

Slide 80 text

Shard Shard Shard Mongos 1 2 3 4 Mongos returns results to client

Slide 81

Slide 81 text

Shard Shard Shard Mongos Cluster Request Routing: Non-Targeted Query

Slide 82

Slide 82 text

Shard Shard Shard Mongos 1 Non-Targeted Request Received

Slide 83

Slide 83 text

Shard Shard Shard Mongos 1 2 2 2 Request sent to all shards

Slide 84

Slide 84 text

Shard Shard Shard Mongos 1 2 2 2 3 3 3 Shards return results to mongos

Slide 85

Slide 85 text

Shard Shard Shard Mongos 1 2 2 2 3 3 3 4 Mongos returns results to client

Slide 86

Slide 86 text

Shard Shard Shard Mongos Cluster Request Routing: Non-Targeted Query with Sort

Slide 87

Slide 87 text

Shard Shard Shard Mongos 1 Non-Targeted request with sort received

Slide 88

Slide 88 text

Shard Shard Shard Mongos 1 2 2 2 Request sent to all shards

Slide 89

Slide 89 text

Shard Shard Shard Mongos 1 2 2 2 3 3 3 Query and sort performed locally

Slide 90

Slide 90 text

Shard Shard Shard Mongos 1 2 2 2 4 4 4 3 3 3 Shards return results to mongos

Slide 91

Slide 91 text

Shard Shard Shard Mongos 1 2 2 2 4 4 4 3 3 3 5 Mongos merges sorted results

Slide 92

Slide 92 text

Shard Shard Shard Mongos 1 2 2 2 4 4 4 3 3 3 6 5 Mongos returns results to client

Slide 93

Slide 93 text

Shard Key

Slide 94

Slide 94 text

Shard Key •  Choose a field common used in queries •  Shard key is immutable •  Shard key values are immutable •  Shard key requires index on fields contained in key •  Uniqueness of `_id` field is only guaranteed within individual shard •  Shard key limited to 512 bytes in size

Slide 95

Slide 95 text

Shard Key Considerations •  Cardinality •  Write distribution •  Query isolation •  Data Distribution

Slide 96

Slide 96 text

Sharding enables scale

Slide 97

Slide 97 text

Working Set Exceeds Physical Memory

Slide 98

Slide 98 text

Click tracking Read/Write throughput exceeds I/O

Slide 99

Slide 99 text

Thank You •  #ConferenceHashTag •  Speaker Name •  Job Title, 10gen