MongoDB UK 2012: MongoDB Oplog Magic

Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Mongo Replica Set ● asynchronous replication ● multiple nodes that are copies of each other ● one primary, multiple secondaries (slaves) ● automatic election ● writes are only handled by primary only!

Slide 3

Slide 3 text

Mongo Replica Set

Slide 4

Slide 4 text

Oplog ● oplog - special collection (capped) ● oplog - records each write operation ● replicas “tail the oplog” for new updates ● new ops are replayed on secondaries

Slide 5

Slide 5 text

Oplog example: insert {u'h': -1469300750073380169L, u'ns': u'mydb.tweets', u'o': {u'_id': ObjectId('4e95ae77a20e6164850761cd'), u'content': u'Lorem ipsum', u'nr': 16}, u'op': u'i', u'ts': Timestamp(1318432375, 1)}

Slide 6

Slide 6 text

Oplog example: update {u'h': -5295451122737468990L, u'ns': u'mydb.tweets', u'o': {u'$set': {u'content': u'Lorem ipsum'}}, u'o2': {u'_id': ObjectId('4e95ae3616692111bb000001')}, u'op': u'u', u'ts': Timestamp(1318432339, 1)}

Slide 7

Slide 7 text

Q: What we want? A: Cross-cluster oplog replay (replay oplogs from one mongo cluster to another). Custom oplog replay

Slide 8

Slide 8 text

Custom oplog replay Q: How do we do that? A: Using OplogReplay (now Open Source) ./oplog-replay localhost:27017 localhost:27018

Slide 9

Slide 9 text

How it works? A: Very similar to MongoDB internal oplog :) tail the oplog for new entries: apply oplog entry save timestamp of last entry

Slide 10

Slide 10 text

How it works? ● last timestamp is persisted on destination > oplogreplay.settings.findOne() { "_id" : "misc-lastts", "value" : { "t" : 1335960424000, "i" : 770 } } ● restarting will replay entries newer than last timestamp

Slide 11

Slide 11 text

Other features? TODO - explain what else can it do? - also db & collection regexp, start from point-in-time

Slide 12

Slide 12 text

Want more? TODO - show how it can be easily extended? inheritance + skip deletes

Slide 13

Slide 13 text

Inverted pyramid Recent data is more important TODO - add picture here (see notes)

Slide 14

Slide 14 text

Inverted pyramid Q: How can we store historical data cheaper? A: Keep data in two distinct mongo clusters: ● recent - only last 30 days, more resources ● historical - all data, but less resources (or even more clusters...)

Slide 15

Slide 15 text

Advantages Q: Why bother with distinct mongo clusters? A: Several reasons: ● different # of shards ● different # of replica sets ● more / less RAM ● adjust storage size

Slide 16

Slide 16 text

Implementation (1) Setup an oplog-replay between clusters: TODO - add picture ( recent ) ---oplogreplat---> ( historical )

Slide 17

Slide 17 text

Implementation (2) Modify your code to know about separation def get_data(since, until): results = [] T = compute_time_threshold() if since <= T: results += get_hist_data(since, T) if until > T: results += get_rcnt_data(T, until) return results

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Splitting Replica Sets ● split one mongo cluster in two ● with no downtime

Slide 20

Slide 20 text

Splitting Replica Sets Q: How it's done? A:

Slide 21

Slide 21 text

Splitting Replica Sets i. create new node (Secondary), wait for it to catch up ii. stop node, remove from ReplicaSet iii. hack its internal state to look like a NEW replica set iv. stop oplogreplay from point-in-time v. redirect your app code (all at once or one at a time, depending on your application needs)

Slide 22

Slide 22 text

Limitations TODO - mongos → one oplogreplay per shard, but balancer deletes is an issue!

Slide 23

Slide 23 text

Limitations TODO - how to overcome balancer issue?

Slide 24

Slide 24 text

Limitations TODO - if the oplogreplay falls behind for too long, there is no recovery procedure