Slide 1

Slide 1 text

Ross Lawley - [email protected] twitter: @RossC0 Replication and Replica Sets Tuesday, 20 March 12

Slide 2

Slide 2 text

Today's Talk • Replication, elections and configuration • High availability scenarios Tuesday, 20 March 12

Slide 3

Slide 3 text

Use cases • High Availability (auto-failover) • Read Scaling (extra copies to read from) • Backups • Online, Delayed Copy (fat finger) • Point in Time (PiT) backups • Use (hidden) replica for secondary workload • Analytics • Data-procesing • Integration with external systems Tuesday, 20 March 12

Slide 4

Slide 4 text

Types of outage Planned • Hardware upgrade • O/S or file-system tuning • Relocation of data to new file-system / storage • Software upgrade Unplanned • Hardware failure • Data center failure • Region outage • Human error • Application corruption Tuesday, 20 March 12

Slide 5

Slide 5 text

Replica Set features • A cluster of N servers • All writes to primary • Reads can be to primary (default) or a secondary • Any (one) node can be primary • Consensus election of primary • Automatic failover • Automatic recovery Tuesday, 20 March 12

Slide 6

Slide 6 text

How MongoDB Replication works Member 1 Member 2 Member 3 • Set is made up of 2 or more nodes Tuesday, 20 March 12

Slide 7

Slide 7 text

How MongoDB Replication works • Election establishes the PRIMARY • Data replication from PRIMARY to SECONDARY Member 1 Member 2 Primary Member 3 Tuesday, 20 March 12

Slide 8

Slide 8 text

How MongoDB Replication works • PRIMARY may fail • Automatic election of new PRIMARY if majority exists Member 1 Member 2 DOWN Member 3 negotiate new master Tuesday, 20 March 12

Slide 9

Slide 9 text

How MongoDB Replication works • New PRIMARY elected • Replica Set re-established Member 1 Member 2 DOWN Member 3 Primary Tuesday, 20 March 12

Slide 10

Slide 10 text

How MongoDB Replication works • Automatic recovery Member 1 Member 3 Primary Member 2 Recovering Tuesday, 20 March 12

Slide 11

Slide 11 text

How MongoDB Replication works • Replica Set re-established Member 1 Member 3 Primary Member 2 Tuesday, 20 March 12

Slide 12

Slide 12 text

How's Replication work? • Change operations are written to the oplog • The oplog is a capped collection (fixed size) • Must have enough space to allow new secondaries to catch up after copying from a primary • Must have enough space to cope with any applicable slaveDelay • Secondaries query the primary's oplog and apply what they find • All replicas contain an oplog Tuesday, 20 March 12

Slide 13

Slide 13 text

mongod --replSet --oplogSize > cfg = { _id : "myset", members : [ { _id : 0, host : "berlin1.acme.com" }, { _id : 1, host : "berlin2.acme.com" }, { _id : 2, host : "berlin3.acme.com" } ] } > use admin > db.runCommand( { replSetInitiate : cfg } ) Creating a Replica Set Tuesday, 20 March 12

Slide 14

Slide 14 text

Managing a Replica Set • rs.conf() • Shell helper: get current configuration • rs.initiate(); • Shell helper: initiate replica set • rs.reconfig() • Shell helper: reconfigure a replica set • rs.add("hostname:") • Shell helper: add a new member • rs.remove("hostname:") • Shell helper: remove a member Tuesday, 20 March 12

Slide 15

Slide 15 text

Managing a Replica Set • rs.status() • Reports status of the replica set from one node's point of view • rs.stepDown() • Request the primary to step down • rs.freeze() • Prevents any changes to the current replica set configuration (primary/secondary status) • Use during backups Tuesday, 20 March 12

Slide 16

Slide 16 text

Priorities http://www.flickr.com/photos/21975042@N02/3866780999 Tuesday, 20 March 12

Slide 17

Slide 17 text

Priorities • Priority, floating point number between 0 and 100 • Used during an election: • Most up to date • Highest priority • Less than 10s behind failed Primary • Allows weighting of members during failover Tuesday, 20 March 12

Slide 18

Slide 18 text

Priorities - example • Assuming all members are up to date • Members A or B will be chosen first • Highest priority • Members C or D will be chosen when: • A and B are unavailable • A and B are not up to date • Member E is never chosen • priority:0 means it cannot be elected A p:10 B p:10 C p:1 D p:1 E p:0 Tuesday, 20 March 12

Slide 19

Slide 19 text

Writes Concerns http://www.guardian.co.uk/gpc/gallery/view-the-print-site#/?picture=346957405 Tuesday, 20 March 12

Slide 20

Slide 20 text

Writes Concerns db.runCommand({getLastError: 1, w : 1}) • ensure write is synchronous • command returns after primary has written to memory w: n or w: 'majority' • n is the number of nodes data must be replicated to • driver will always send writes to Primary w: 'my_tag' • Each member is "tagged" e.g. "allDCs" • Ensure that the write is executed in each tagged "region" Tuesday, 20 March 12

Slide 21

Slide 21 text

Writes Concerns fsync: true • Ensures changed disk blocks are flushed to disk j: true • Ensures changes are flushed to the Journal Tuesday, 20 March 12

Slide 22

Slide 22 text

Tagging http://www.flickr.com/photos/dalvenjah/3827183653 Tuesday, 20 March 12

Slide 23

Slide 23 text

Tagging • Control over where data is written to. • Each member can have one or more tags: tags: {dc: "ber"} tags: {dc: "ber", ip: "192.168", rack: "row3-rk7"} • Replica set defines rules for where data resides • Rules can change without change application code Tuesday, 20 March 12

Slide 24

Slide 24 text

{_id : "mySet", members : [ {_id : 0, host : "A", tags : {"dc": "ber"}}, {_id : 1, host : "B", tags : {"dc": "ber"}}, {_id : 2, host : "C", tags : {"dc": "lon"}}, {_id : 4, host : "E", tags : {"dc": "nyc"}}] settings : { getLastErrorModes : { allDCs : {"dc" : 3}, someDCs : {"dc" : 2}}} } > db.post.insert({...}) > db.runCommand({getLastError : 1, w : "allDCs"}) Tagging - example Tuesday, 20 March 12

Slide 25

Slide 25 text

Eventual Consistency http://www.flickr.com/photos/26095468@N04/3779692985 Tuesday, 20 March 12

Slide 26

Slide 26 text

Using Replicas for Reads • Read Preference / Slave Okay • driver will always send writes to Primary • driver will send read requests to Secondaries • Python examples • Connection(read_preference=ReadPreference.PRIMARY) • db.read_preference = ReadPreference.SECONDARY_ONLY • db.test.read_preference = ReadPreference.SECONDARY • db.test.find(read_preference=ReadPreference.SECONDARY) Tuesday, 20 March 12

Slide 27

Slide 27 text

Using Replicas for reads • Warning! • Secondaries may be out of date • Not applicable for all applications • Sharding provides consistent scaling of reads Tuesday, 20 March 12

Slide 28

Slide 28 text

Replication features • Reads from Primary are always consistent • Reads from Secondaries are eventually consistent • Automatic failover if a Primary fails • Automatic recovery when a node joins the set • Full control of where writes occur Tuesday, 20 March 12

Slide 29

Slide 29 text

The examples we run out of time for... Tuesday, 20 March 12

Slide 30

Slide 30 text

High Availability Scenarios http://www.flickr.com/photos/mag3737/452820590 Tuesday, 20 March 12

Slide 31

Slide 31 text

• Will have downtime • If node crashes human intervention might be needed Single Node Tuesday, 20 March 12

Slide 32

Slide 32 text

• Single datacenter • Single switch & power • One node failure • Automatic recovery of single node crash • Points of failure: • Power • Network • Datacenter Replica Set 1 Arbiter Tuesday, 20 March 12

Slide 33

Slide 33 text

• Single datacenter • Multiple power/network zones • Automatic recovery of single node crash • w=2 not viable as losing 1 node means no writes • Points of failure: • Datacenter • Two node failure Replica Set 2 Arbiter Tuesday, 20 March 12

Slide 34

Slide 34 text

• Single datacenter • Multiple power/network zones • Automatic recovery of single node crash • w=2 viable as 2/3 online • Points of failure: • Datacenter • Two node failure Replica Set 3 Tuesday, 20 March 12

Slide 35

Slide 35 text

http://www.calgaryherald.com When disaster strikes Tuesday, 20 March 12

Slide 36

Slide 36 text

• Multi datacenter • DR node for safety • Can't do multi data center durable write safely since only 1 node in distant DC Replica Set 4 Tuesday, 20 March 12

Slide 37

Slide 37 text

• Three data centers • Can survive full data center loss • Can do w= { dc : 2 } to guarantee write in 2 data centers Replica Set 5 Tuesday, 20 March 12

Slide 38

Slide 38 text

Typical Deployments Use? Set size Data Protection High Availability Notes X One No No Must use --journal to protect against crashes Two Yes No On loss of one member, surviving member is read only Three Yes Yes - 1 failure On loss of one member, surviving two members can elect a new primary X Four Yes Yes - 1 failure* * On loss of two members, surviving two members are read only Five Yes Yes - 2 failures On loss of two members, surviving three members can elect a new primary Tuesday, 20 March 12

Slide 39

Slide 39 text

@mongodb conferences, appearances, and meetups http://www.10gen.com/events http://bit.ly/mongofb Facebook | Twitter | LinkedIn http://linkd.in/joinmongo download at mongodb.org support, training, and this talk brought to you by Tuesday, 20 March 12