Slide 1

Slide 1 text

MongoDB Replication

Slide 2

Slide 2 text

Philipp Krenn @xeraa ecosio & ViennaDB

Slide 3

Slide 3 text

Motivation Availability & data safety Read scalability Helping backups

Slide 4

Slide 4 text

Data migration Delayed members Oplog Tailing (Meteor.js) https://meteorhacks.com/mongodb-oplog-and-meteor.html

Slide 5

Slide 5 text

Basics

Slide 6

Slide 6 text

Terminology Primary + Secondaries Master + Slaves problematic — renamed Arbiter

Slide 7

Slide 7 text

http://docs.mongodb.org

Slide 8

Slide 8 text

http://docs.mongodb.org

Slide 9

Slide 9 text

http://docs.mongodb.org > rs.addArb("arbiter.example.com:3000")

Slide 10

Slide 10 text

http://docs.mongodb.org

Slide 11

Slide 11 text

Limits 50 replica set members 12 before 2.7.8 7 voting members

Slide 12

Slide 12 text

Example

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Single instance $ mkdir 1 $ mongod --dbpath 1 --port 27001 --logpath log1 $ mongo --port 27001 > db.test.insert({ name: "Philipp", city: "Wien" }) > db.test.find() Stop instance

Slide 15

Slide 15 text

Add replication $ mkdir 2 $ mkdir 3 $ mongod --replSet javantura --dbpath 1 --port 27001 --logpath log1 --oplogSize 20 $ mongod --replSet javantura --dbpath 2 --port 27002 --logpath log2 --oplogSize 20 $ mongod --replSet javantura --dbpath 3 --port 27003 --logpath log3 --oplogSize 20

Slide 16

Slide 16 text

Connect $ hostname $ mongo --port 27001 > db.test.find()

Slide 17

Slide 17 text

Configure replication Start on the old instance, otherwise data lost rs.initiate() rs.status() rs.add("PK-MBP:27002") rs.add("PK-MBP:27003") rs.status() db.isMaster() db.test.find() db.test.insert({ name: "Peter", city: "Steyr" }) db.test.find()

Slide 18

Slide 18 text

Read from secondaries $ mongo --port 27002 > db.test.find() > rs.slaveOk() > db.test.find() > db.test.insert({ name: "Dieter", city: "Graz" }) slaveOk only valid for the current connection

Slide 19

Slide 19 text

Failover Kill primary with [Ctrl]+[C] Write to new primary > rs.status() > db.test.insert({ name: "Dieter", city: "Graz" }) > db.test.find()

Slide 20

Slide 20 text

Restart old primary $ mongod --replSet name --dbpath 1 --port 27001 --logpath log1 --oplogSize 20 $ mongo --port 27001 > rs.status() > rs.slaveOk() > db.test.find()

Slide 21

Slide 21 text

Election

Slide 22

Slide 22 text

Heartbeat 2s interval 10s until election

Slide 23

Slide 23 text

Election rules 1. Priority 2. Optime 3. Connections

Slide 24

Slide 24 text

Priority cfg = rs.conf() cfg.members[0].priority = 0 cfg.members[1].priority = 1 cfg.members[2].priority = 2 rs.reconfig(cfg)

Slide 25

Slide 25 text

Optime

Slide 26

Slide 26 text

Connections

Slide 27

Slide 27 text

Election Candidate node asks for a vote Others can veto

Slide 28

Slide 28 text

Election One yes for one node within 30s Majority yes elects a new primary

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Issues

Slide 31

Slide 31 text

CAP Select Availability or Consistency Partition-tolerance is a prerequisite for distributed systems "The network is reliable": http://aphyr.com/posts/288-the-network-is-reliable

Slide 32

Slide 32 text

Rollback Old primary rolls back unreplicated changes once it rejoins the replica set

Slide 33

Slide 33 text

Rollback file rollback/ in data folder File name: .. .bson

Slide 34

Slide 34 text

Election time At times 5 to 7 minutes http://www.tokutek.com/2014/07/explaining-ark- part-2-how-elections-and-failover-currently-work/

Slide 35

Slide 35 text

Missing synchronization during election Old primary sends last changes to a single node If not new primary: rollback

Slide 36

Slide 36 text

Remember Replication is asynchronous

Slide 37

Slide 37 text

Multiple primaries Unlikely but possible Bugs: https://jira.mongodb.org/browse/SERVER-9765 Test script with no replies: https://groups.google.com/ forum/#!topic/mongodb-dev/-mH6BOYyzeI

Slide 38

Slide 38 text

Kyle Kingsbury @aphyr: Call Me Maybe http://aphyr.com/tags/jepsen PostgreSQL, Redis, MongoDB, Riak, Zookeeper, RabbitMQ, etcd + Consul, ElasticSearch

Slide 39

Slide 39 text

http://aphyr.com/posts/284-call-me- maybe-mongodb 05/2013 version 2.4 Up to 42% data lost Data written to old primary: rollback

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

WriteConcern Configure durability vs performance https://github.com/mongodb/mongo-java-driver/blob/ master/src/main/com/mongodb/WriteConcern.java

Slide 42

Slide 42 text

WriteConcern. UNACKNOWLEDGED w=0, j=0 Fire and forget Default until 11/2012

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

WriteConcern. ACKNOWLEDGED w=1, j=0 Current default Operation successful in memory

Slide 45

Slide 45 text

WriteConcern. JOURNALED w=1, j=1 Operation written to the journal file Since 1.8, single server durability

Slide 46

Slide 46 text

WriteConcern.FSYNCED w=1, fsync=true Operation written to disk

Slide 47

Slide 47 text

WriteConcern. REPLICA_ACKNOWLEDGED w=2, j=0 Acknowledged by primary and at least one secondary w is the server number

Slide 48

Slide 48 text

WriteConcern. MAJORITY w=majority, j=0 Acknowledgement by the majority of nodes wtimeout recommended

Slide 49

Slide 49 text

WriteConcern. MAJORITY Nearly no data lost, but high overhead

Slide 50

Slide 50 text

Write concern performance https://blog.serverdensity.com/mongodb-on-google- compute-engine-tips-and-benchmarks/ 3 x 1,000 inserts on GCE Local 10GB system disk Dedicated 200GB disk Dedicated 200GB for data and journal

Slide 51

Slide 51 text

n1-standard-2

Slide 52

Slide 52 text

n1-highmem-8

Slide 53

Slide 53 text

Thanks! Questions? Now, later today, or @xeraa

Slide 54

Slide 54 text

Backup Slides

Slide 55

Slide 55 text

Oplog

Slide 56

Slide 56 text

Replication via logs MongoDB: Operations log (Oplog) MySQL: Binary log (Binlog)

Slide 57

Slide 57 text

Naiv approach: Transmit original query Statement Based Replication (SBR) DELETE FROM test.table WHERE quantity > 20 LIMIT 1 db.collection.remove({ quantity: { $gt: 20 }}, true) //justOne: true

Slide 58

Slide 58 text

Unambiguous representation Row-Based Replication (RBR): Oplog

Slide 59

Slide 59 text

MongoDB Asynchronous replication Secondaries can get the Oplog from: their primary a secondary with more recent data

Slide 60

Slide 60 text

Oplog size 32bit: 48MB 64bit OS X: 183MB 64bit *nix, Windows: 1GB to 50GB (5% free disk)

Slide 61

Slide 61 text

Inner details

Slide 62

Slide 62 text

Capped collection in oplog.rs of the local database > use local > show collections me 0.000MB / 0.008MB oplog.rs 0.000MB / 20.000MB replset.minvalid 0.000MB / 0.008MB slaves 0.000MB / 0.008MB startup_log 0.003MB / 10.000MB system.indexes 0.001MB / 0.008MB system.replset 0.000MB / 0.008MB

Slide 63

Slide 63 text

> db.oplog.rs.find() { "h": NumberLong("-265486071808715859"), "ns": "test.test", "o": { "_id": ObjectId("541a8ed285ea5f8ae059d530"), "name": "Dieter" "city": "Graz" }, "op": "i", "ts": Timestamp(1411026642, 1), "v": 2 } ...