replication

Replication! mike o’brien software engineer @ 10gen Replication! Replication! Monday,
March 12, 12

What we’ll cover • What a replica set is and
why you want it • The mechanics of how a replica set works • How to set it up • How to handle it with drivers • Some dos + don’ts for deployment Monday, March 12, 12

Replica Set • DB nodes whose goal is to create
a complete copy of data on each node • Only one primary at a time • All other nodes are secondaries • If the primary fails, a secondary is chosen to take over PRIMARY secondary1 secondary2 Monday, March 12, 12

Replica Set • Only one primary at any time •
Only the primary accepts writes (i.e. writes are strongly consistent) • Secondaries are read-only • Secondaries talk to primary to keep their own copies in sync PRIMARY secondary1 secondary2 Monday, March 12, 12

Why • App can survive a database node failure •
Extra copies of data = redundancy • Scaling Reads: Sending .ﬁnd() queries to secondaries • Makes backups easier • Use hidden replicas for secondary workload: analytics, integration with other systems, etc. • Data-center awareness: survive an entire data center outage Monday, March 12, 12

What happens when a node fails? • Replica set members
monitor each other with heartbeats - ping every 2 seconds • If the primary can’t be reached, an election is triggered - each node gets a vote and knows the total # available votes • If no node can reach a majority, replica set becomes read-only Monday, March 12, 12

m0 primary m2 m1 Monday, March 12, 12

m2 m1 m0 (down) ? ? Primary no longer visible
Monday, March 12, 12

m2 (primary) m1 m0 (down) election is triggered ? ?

m2 (primary) m1 m0 RECOVERING Recovery is Automatic Monday, March
12, 12

m2 (primary) m1 m0 Replica set reëstablished Monday, March 12,
12

How it Works • Change operations are written to an
oplog (capped collection) on the primary • Secondaries query the oplog and apply the changes • All replicas get their own oplog Monday, March 12, 12

oplog is ﬁxed size • capped collection - like a
circular queue • defaults to 5% of disk space (on 64 bit) this is usually plenty • eventually it will ﬁll up.... • if a slave falls too far behind, it will need to resync Monday, March 12, 12

{ "ts" : { "t" : 1329197785000, "i" : 1
}, "h" : NumberLong("4816916911793111057"), "op" : "i", "ns" : "test.stuff", "o" : { "_id" : ObjectId("4f39f2d91b645b4d80fb2e86"), "a" : 1 } } timestamp unique identiﬁer operation type namespace operation An oplog entry looks like this: Monday, March 12, 12

oplog entries are idempotent i.e.: replaying the same entry yields
same result update({age:{$lt:5}}, {$inc:{x:1}}) one update cmd multiple oplog entries set x=1 for ObjectId("4f3bf37bdbb51e2beb325867") set x=1 for ObjectId("4f3bf37ddbb51e2beb325868") set x=1 for ObjectId("4f3bf37ddbb51e2beb325869") etc... Monday, March 12, 12

Launching a Replica Set > var config = { _id
= “austin”, members:[ {_id:0, host:”host1.weylandyutani.com”}, {_id:1, host:”host2.weylandyutani.com”}, {_id:2, host:”host3.weylandyutani.com”}, ] } > use admin; > rs.initiate(config); Start your mongod processes with --replSet <name>, then: Monday, March 12, 12

Replication Utilities • rs.add(“hostname:port”) - add a new member •
rs.remove(“hostname:port”) - remove a member • rs.status() - get an overview of replica set health • rs.stepDown() - step down as primary • rs.reconfig(config) updates the replicaset config • rs.slaveOk() - on a secondary, enable read queries Monday, March 12, 12

Replica Set Options • {arbiterOnly:true} makes this node an arbiter
- votes in elections, but stores no data • {priority: p} set a preference for election as primary - priority:0 means node can never become primary useful for backups, reporting, etc. • {slaveDelay : <seconds>} Number of seconds to remain behind primary. Useful for accident recovery, rolling backups, etc. Monday, March 12, 12

Drivers are replica-set aware! by passing options to getLastError(), we
can get a guarantee of successful replication from pymongo import ReplicaSetConnection db = ReplicaSetConnection().test db.u.update({“name”:”bob”}, {“$inc”:{“age”:1}}, safe=True, w=2); db.u.update({“name”:”bob”}, {“$inc”:{“age”:1}}, safe=True, w=”majority”); Ensure write on >=2 nodes Ensure write reaches majority of nodes Monday, March 12, 12

Scaling reads with Secondary Nodes • .slaveOk() enables read-queries on
secondary nodes • Good for read-heavy situations • Not necessarily helpful for write-heavy situations • This does not increase your working set size (need sharding) Monday, March 12, 12

Drivers can handle sending read- queries to secondaries from pymongo
import ReplicaSetConnection, from pymongo import ReadPreference c = ReplicaSetConnection(read_preference=ReadPreference.SECONDARY) db = c.test db.u.find_one({“name”:”bob”}) These reads are eventually consistent If you need strong consistency, stick with ReadPreference.PRIMARY Monday, March 12, 12

Strong Consistency Monday, March 12, 12

Eventual Consistency Monday, March 12, 12

Deployment Strategies • Odd # of members for elections •
Minimum of 3 members Monday, March 12, 12

5 nodes On loss of <=2 nodes, survivors can elect
new primary good 4 nodes Survives 1 failure. On 2 failures, remaining 2 nodes become read-only. bad 3 nodes Survives 1 failure. On 1 failure, elects new primary. good 2 nodes Becomes read-only on loss of a single member bad 1 node this isn’t even a replica set, actually bad! Monday, March 12, 12

Network Setup • Each member should have its own machine
• Use arbiters for more lightweight setup • If sharding, each shard should be a complete replica set • Up to 12 replica set members, 7 of which are allowed to vote Monday, March 12, 12

Single Data Center Single switch and power 2 node failure

Multi Data-Center v 2.0+: Tagging for Data-Center Awareness Monday, March
12, 12

Using Replica Set for backing up data primary Monday, March
12, 12

primary locked secondary fsync + lock a secondary Monday, March
12, 12

primary locked secondary dump to backup Monday, March 12, 12

primary recovering unlock the secondary catches up automatically (make sure
your oplog size is big enough) Monday, March 12, 12

similar idea: build indexes on secondaries for each server in
secondaries: - shut down the server - restart server as standalone - log in to server and build the index - shut down the server - restart server as a secondary again - step down the primary and repeat (rolling update) Monday, March 12, 12

thanks! questions also, we’re hiring! 10gen.com/jobs [email protected] @mpobrien Monday, March
12, 12

replication

replication

mpobrien

More Decks by mpobrien

Featured

Transcript

Replication! mike o’brien software engineer @ 10gen Replication! Replication! Monday,

What we’ll cover • What a replica set is and

Replica Set • DB nodes whose goal is to create

Replica Set • Only one primary at any time •

Why • App can survive a database node failure •

What happens when a node fails? • Replica set members

m0 primary m2 m1 Monday, March 12, 12

m2 m1 m0 (down) ? ? Primary no longer visible

m2 (primary) m1 m0 (down) election is triggered ? ?

m2 (primary) m1 m0 RECOVERING Recovery is Automatic Monday, March

m2 (primary) m1 m0 Replica set reëstablished Monday, March 12,

How it Works • Change operations are written to an

oplog is ﬁxed size • capped collection - like a

{ "ts" : { "t" : 1329197785000, "i" : 1

oplog entries are idempotent i.e.: replaying the same entry yields

Launching a Replica Set > var config = { _id

Replication Utilities • rs.add(“hostname:port”) - add a new member •

Replica Set Options • {arbiterOnly:true} makes this node an arbiter

Drivers are replica-set aware! by passing options to getLastError(), we

Scaling reads with Secondary Nodes • .slaveOk() enables read-queries on

Drivers can handle sending read- queries to secondaries from pymongo

Strong Consistency Monday, March 12, 12

Eventual Consistency Monday, March 12, 12

Deployment Strategies • Odd # of members for elections •

5 nodes On loss of <=2 nodes, survivors can elect

Network Setup • Each member should have its own machine

Single Data Center Single switch and power 2 node failure

Multi Data-Center v 2.0+: Tagging for Data-Center Awareness Monday, March

Using Replica Set for backing up data primary Monday, March

primary locked secondary fsync + lock a secondary Monday, March

primary locked secondary dump to backup Monday, March 12, 12

primary recovering unlock the secondary catches up automatically (make sure

similar idea: build indexes on secondaries for each server in

thanks! questions also, we’re hiring! 10gen.com/jobs [email protected] @mpobrien Monday, March