Replicating MongoDB

Replicating MongoDB

...what could go wrong?
Presentation for the Javantura conference in Zagreb.

Ce4685da897c912aa41a815435b40a5a?s=128

Philipp Krenn

November 15, 2014
Tweet

Transcript

  1. MongoDB Replication

  2. Philipp Krenn @xeraa ecosio & ViennaDB

  3. Motivation Availability & data safety Read scalability Helping backups

  4. Data migration Delayed members Oplog Tailing (Meteor.js) https://meteorhacks.com/mongodb-oplog-and-meteor.html

  5. Basics

  6. Terminology Primary + Secondaries Master + Slaves problematic — renamed

    Arbiter
  7. http://docs.mongodb.org

  8. http://docs.mongodb.org

  9. http://docs.mongodb.org > rs.addArb("arbiter.example.com:3000")

  10. http://docs.mongodb.org

  11. Limits 50 replica set members 12 before 2.7.8 7 voting

    members
  12. Example

  13. None
  14. Single instance $ mkdir 1 $ mongod --dbpath 1 --port

    27001 --logpath log1 $ mongo --port 27001 > db.test.insert({ name: "Philipp", city: "Wien" }) > db.test.find() Stop instance
  15. Add replication $ mkdir 2 $ mkdir 3 $ mongod

    --replSet javantura --dbpath 1 --port 27001 --logpath log1 --oplogSize 20 $ mongod --replSet javantura --dbpath 2 --port 27002 --logpath log2 --oplogSize 20 $ mongod --replSet javantura --dbpath 3 --port 27003 --logpath log3 --oplogSize 20
  16. Connect $ hostname $ mongo --port 27001 > db.test.find()

  17. Configure replication Start on the old instance, otherwise data lost

    rs.initiate() rs.status() rs.add("PK-MBP:27002") rs.add("PK-MBP:27003") rs.status() db.isMaster() db.test.find() db.test.insert({ name: "Peter", city: "Steyr" }) db.test.find()
  18. Read from secondaries $ mongo --port 27002 > db.test.find() >

    rs.slaveOk() > db.test.find() > db.test.insert({ name: "Dieter", city: "Graz" }) slaveOk only valid for the current connection
  19. Failover Kill primary with [Ctrl]+[C] Write to new primary >

    rs.status() > db.test.insert({ name: "Dieter", city: "Graz" }) > db.test.find()
  20. Restart old primary $ mongod --replSet name --dbpath 1 --port

    27001 --logpath log1 --oplogSize 20 $ mongo --port 27001 > rs.status() > rs.slaveOk() > db.test.find()
  21. Election

  22. Heartbeat 2s interval 10s until election

  23. Election rules 1. Priority 2. Optime 3. Connections

  24. Priority cfg = rs.conf() cfg.members[0].priority = 0 cfg.members[1].priority = 1

    cfg.members[2].priority = 2 rs.reconfig(cfg)
  25. Optime

  26. Connections

  27. Election Candidate node asks for a vote Others can veto

  28. Election One yes for one node within 30s Majority yes

    elects a new primary
  29. None
  30. Issues

  31. CAP Select Availability or Consistency Partition-tolerance is a prerequisite for

    distributed systems "The network is reliable": http://aphyr.com/posts/288-the-network-is-reliable
  32. Rollback Old primary rolls back unreplicated changes once it rejoins

    the replica set
  33. Rollback file rollback/ in data folder File name: <database>.<collection>. <timestamp>.bson

  34. Election time At times 5 to 7 minutes http://www.tokutek.com/2014/07/explaining-ark- part-2-how-elections-and-failover-currently-work/

  35. Missing synchronization during election Old primary sends last changes to

    a single node If not new primary: rollback
  36. Remember Replication is asynchronous

  37. Multiple primaries Unlikely but possible Bugs: https://jira.mongodb.org/browse/SERVER-9765 Test script with

    no replies: https://groups.google.com/ forum/#!topic/mongodb-dev/-mH6BOYyzeI
  38. Kyle Kingsbury @aphyr: Call Me Maybe http://aphyr.com/tags/jepsen PostgreSQL, Redis, MongoDB,

    Riak, Zookeeper, RabbitMQ, etcd + Consul, ElasticSearch
  39. http://aphyr.com/posts/284-call-me- maybe-mongodb 05/2013 version 2.4 Up to 42% data lost

    Data written to old primary: rollback
  40. None
  41. WriteConcern Configure durability vs performance https://github.com/mongodb/mongo-java-driver/blob/ master/src/main/com/mongodb/WriteConcern.java

  42. WriteConcern. UNACKNOWLEDGED w=0, j=0 Fire and forget Default until 11/2012

  43. None
  44. WriteConcern. ACKNOWLEDGED w=1, j=0 Current default Operation successful in memory

  45. WriteConcern. JOURNALED w=1, j=1 Operation written to the journal file

    Since 1.8, single server durability
  46. WriteConcern.FSYNCED w=1, fsync=true Operation written to disk

  47. WriteConcern. REPLICA_ACKNOWLEDGED w=2, j=0 Acknowledged by primary and at least

    one secondary w is the server number
  48. WriteConcern. MAJORITY w=majority, j=0 Acknowledgement by the majority of nodes

    wtimeout recommended
  49. WriteConcern. MAJORITY Nearly no data lost, but high overhead

  50. Write concern performance https://blog.serverdensity.com/mongodb-on-google- compute-engine-tips-and-benchmarks/ 3 x 1,000 inserts on

    GCE Local 10GB system disk Dedicated 200GB disk Dedicated 200GB for data and journal
  51. n1-standard-2

  52. n1-highmem-8

  53. Thanks! Questions? Now, later today, or @xeraa

  54. Backup Slides

  55. Oplog

  56. Replication via logs MongoDB: Operations log (Oplog) MySQL: Binary log

    (Binlog)
  57. Naiv approach: Transmit original query Statement Based Replication (SBR) DELETE

    FROM test.table WHERE quantity > 20 LIMIT 1 db.collection.remove({ quantity: { $gt: 20 }}, true) //justOne: true
  58. Unambiguous representation Row-Based Replication (RBR): Oplog

  59. MongoDB Asynchronous replication Secondaries can get the Oplog from: their

    primary a secondary with more recent data
  60. Oplog size 32bit: 48MB 64bit OS X: 183MB 64bit *nix,

    Windows: 1GB to 50GB (5% free disk)
  61. Inner details

  62. Capped collection in oplog.rs of the local database > use

    local > show collections me 0.000MB / 0.008MB oplog.rs 0.000MB / 20.000MB replset.minvalid 0.000MB / 0.008MB slaves 0.000MB / 0.008MB startup_log 0.003MB / 10.000MB system.indexes 0.001MB / 0.008MB system.replset 0.000MB / 0.008MB
  63. > db.oplog.rs.find() { "h": NumberLong("-265486071808715859"), "ns": "test.test", "o": { "_id":

    ObjectId("541a8ed285ea5f8ae059d530"), "name": "Dieter" "city": "Graz" }, "op": "i", "ts": Timestamp(1411026642, 1), "v": 2 } ...