Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloning the Cloud - Riak and Multi Data Center Replication (RICON2012)

Cloning the Cloud - Riak and Multi Data Center Replication (RICON2012)

Replicating data inside a Riak cluster and outside of it may sound like the same problem, but they aren't. Latency, bandwidth, security, and other factors make it a significantly different challenge. Basho has invested significant time and effort into building the masterless multi datacenter replication support that is part of Riak Enterprise. This talk will cover the problems, solutions, and evolution of Riak's Multi Data Center support as well as our plans for continued development and enhancement.

Basho Technologies

October 11, 2012
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. About the author •OpenACD/mod_erlang_event/gen_smtp •egitd •@ Basho since March 2011

    •Lager •Lead replication dev since November 2011 •Quickchecked poolboy
  2. Why? •Clusters span a LAN, replication spans the WAN •Riak

    doesn’t like intra-cluster latency •Riak doesn’t support rack-aware- claim (yet) •Rack distribution != datacenter distribution
  3. KISS replication •Post commit hook fires on every PUT •Relays

    PUT object to a node in another cluster •Replicated PUTs do not increment vector clocks, change last- modified or fire post-commit hooks •An open source implementation of this idea exists
  4. Periodic Fullsync •Every N hours, or when 2 clusters connect

    •Compute a Merkle tree of all the keys in each vnode •Send the Merkle tree to the sink cluster, compute the differences and send the difference list to the source, which returns the source’s version of the value
  5. Problems... •How do you coordinate this? Need to elect a

    ‘leader’ to coordinate fullsync and proxy all realtime through this as well •Merkle trees based on couch_btree are not Merkle trees •Mixing realtime and fullsync on the same connection causes back- pressure problems
  6. More problems... •Latency on the network can cause Erlang mailboxes

    to overflow with realtime messages •Customers have already deployed this, so now we need to be backwards compatible
  7. Solutions •Add a bounded queue for realtime that will drop

    objects in an overflow situation •Add ACKs to realtime messages so that we don’t fill the TCP send buffer •Once the client receives a Merkle tree, convert it to key/object- hash list, sort and compare
  8. More problems... •After Riak 1.0, putting me in charge of

    replication •Computing differences on the client and storing that list in RAM until you’re done is bad for large difference sets •Computing differences on the sink is more of a round trip
  9. Even more problems •Fullsync and realtime objects are sent the

    same over the wire, greatly confusing the ACK strategy •No fullsync back-pressure •All replicated puts are done in a spawn()
  10. Solutions •Split code for fullsync and realtime and allow the

    fullsync behaviour to be negotiated at connection time •Implement a new Merkle-free strategy that compares keylists on the source cluster and streams differences to the sink
  11. Solutions, continued •Differentiate between realtime and fullsync objects •Worker pool

    for replicated PUTs •Fullsync back-pressure •Riak 1.1.0
  12. Achievements Unlocked •SSL support •NAT support •More control about how

    buckets replicate •Offload some work from the leader nodes •Riak 1.2.0 & more engineers
  13. Inevitably; more problems •Blocking while waiting for ACKs is bad

    when network latency is high (> 60ms) •Disk order != keylist order -> random disk seeks •Enqueueing and dequeueing in the same process causes Erlang mailbox ordering problems
  14. Solutions •Instead of GETting keys during compare, put the different

    keys i a bloom filter and then re-fold the vnode - traversal is in better order •Allow multiple ACKs for realtime/ fullsync to be in flight at once •Enqueue realtime in other process
  15. Riak CS replication •Different data model; blocks and manifests •Replicating

    manifests before all its blocks is bad •Want new files to be visible on other clusters quickly
  16. And yet... •Realtime replication hits a throughput wall, proxying through

    a single node on each cluster is a big bottleneck •Building keylists for fullsync can take a long, long time on large vnodes and they can’t be reused by other/later fullsyncs
  17. Woe, continued •Fullsync also proxies everything through the leader node

    AND on the same connection as realtime, even worse for the bottleneck •Replication terminology/ configuration is very confusing (listener, site, client, server) •Realtime prone to dropping things
  18. Goals •Build a new architecture for replication that scales to

    higher throughputs and larger keyspaces •Make configuration simpler and more understandable •Make replication more extensible and flexible for the future
  19. BNW - Realtime •Every node in cluster A connects to

    a node in cluster B and streams realtime updates •Realtime queues are per-node, not per-connection and are multi- consumer & bounded •Queue contents are handed off on node shutdown, not lost
  20. BMW Realtime, continued •No need for ACKs/back-pressure, can use TCP

    for congestion control because there’s no other messages on the connection
  21. BNW - Fullsync •Fullsync done over a node -> node

    connection for the nodes that own the vnode being synced •Coordination/scheduling done via a separate connection on leader •Use AAE trees for exchange, not keylists (much cheaper to build and compare)
  22. BNW - Configuration •Use ‘source’ and ‘sink’ terminology consistently •Auto-discover

    as much as possible to reduce configuration overhead •Better per-connection and per- cluster reporting
  23. Special Thanks •Andy Gross •Jon Meredith •Chris Tilt •Dave Parfitt

    •Everyone at Basho •Our patient replication customers