Cloning the Cloud - Riak and Multi Data Center Replication (RICON2012)

Riak Replication October 12th 2012 Andrew Thompson [email protected] github.com/Vagabond

About the author •OpenACD/mod_erlang_event/gen_smtp •egitd •@ Basho since March 2011
•Lager •Lead replication dev since November 2011 •Quickchecked poolboy

Not a sales talk

Why? •Clusters span a LAN, replication spans the WAN •Riak
doesn’t like intra-cluster latency •Riak doesn’t support rack-aware- claim (yet) •Rack distribution != datacenter distribution

Replication is... •Unidirectional or bidirectional •Realtime and/or fullsync •Eventually, eventually
consistent •Closed source, part of EDS

The evolution of replication

KISS replication •Post commit hook fires on every PUT •Relays
PUT object to a node in another cluster •Replicated PUTs do not increment vector clocks, change last- modified or fire post-commit hooks •An open source implementation of this idea exists

Problems... •Network glitches drop objects •Node -> node connections statically
configured •‘Offline’ PUTs also dropped

Periodic Fullsync •Every N hours, or when 2 clusters connect
•Compute a Merkle tree of all the keys in each vnode •Send the Merkle tree to the sink cluster, compute the differences and send the difference list to the source, which returns the source’s version of the value

Problems... •How do you coordinate this? Need to elect a
‘leader’ to coordinate fullsync and proxy all realtime through this as well •Merkle trees based on couch_btree are not Merkle trees •Mixing realtime and fullsync on the same connection causes back- pressure problems

More problems... •Latency on the network can cause Erlang mailboxes
to overflow with realtime messages •Customers have already deployed this, so now we need to be backwards compatible

Realtime

Fullsync

Solutions •Add a bounded queue for realtime that will drop
objects in an overflow situation •Add ACKs to realtime messages so that we don’t fill the TCP send buffer •Once the client receives a Merkle tree, convert it to key/object- hash list, sort and compare

More problems... •After Riak 1.0, putting me in charge of
replication •Computing differences on the client and storing that list in RAM until you’re done is bad for large difference sets •Computing differences on the sink is more of a round trip

Even more problems •Fullsync and realtime objects are sent the
same over the wire, greatly confusing the ACK strategy •No fullsync back-pressure •All replicated puts are done in a spawn()

Solutions •Split code for fullsync and realtime and allow the
fullsync behaviour to be negotiated at connection time •Implement a new Merkle-free strategy that compares keylists on the source cluster and streams differences to the sink

Solutions, continued •Differentiate between realtime and fullsync objects •Worker pool
for replicated PUTs •Fullsync back-pressure •Riak 1.1.0

Achievements Unlocked •SSL support •NAT support •More control about how
buckets replicate •Offload some work from the leader nodes •Riak 1.2.0 & more engineers

Inevitably; more problems •Blocking while waiting for ACKs is bad
when network latency is high (> 60ms) •Disk order != keylist order -> random disk seeks •Enqueueing and dequeueing in the same process causes Erlang mailbox ordering problems

Solutions •Instead of GETting keys during compare, put the different
keys i a bloom filter and then re-fold the vnode - traversal is in better order •Allow multiple ACKs for realtime/ fullsync to be in flight at once •Enqueue realtime in other process

Riak CS replication •Different data model; blocks and manifests •Replicating
manifests before all its blocks is bad •Want new files to be visible on other clusters quickly

Solutions •Realtime replicate only manifests, not blocks •Fullsync blocks •Add
‘proxy get’ to get missing blocks on-demand

And yet... •Realtime replication hits a throughput wall, proxying through
a single node on each cluster is a big bottleneck •Building keylists for fullsync can take a long, long time on large vnodes and they can’t be reused by other/later fullsyncs

Woe, continued •Fullsync also proxies everything through the leader node
AND on the same connection as realtime, even worse for the bottleneck •Replication terminology/ configuration is very confusing (listener, site, client, server) •Realtime prone to dropping things

Brave new world Replication’s not so dystopian future

Goals •Build a new architecture for replication that scales to
higher throughputs and larger keyspaces •Make configuration simpler and more understandable •Make replication more extensible and flexible for the future

BNW - Realtime •Every node in cluster A connects to
a node in cluster B and streams realtime updates •Realtime queues are per-node, not per-connection and are multi- consumer & bounded •Queue contents are handed off on node shutdown, not lost

BMW Realtime, continued •No need for ACKs/back-pressure, can use TCP
for congestion control because there’s no other messages on the connection

Before

BNW - Fullsync •Fullsync done over a node -> node
connection for the nodes that own the vnode being synced •Coordination/scheduling done via a separate connection on leader •Use AAE trees for exchange, not keylists (much cheaper to build and compare)

Before

BNW - Configuration •Use ‘source’ and ‘sink’ terminology consistently •Auto-discover
as much as possible to reduce configuration overhead •Better per-connection and per- cluster reporting

Tomorrow’s World •Strong consistency support •Generic ‘proxy get’, not just
for CS •Replicate between any ring sizes

Special Thanks •Andy Gross •Jon Meredith •Chris Tilt •Dave Parfitt
•Everyone at Basho •Our patient replication customers

Questions?

Cloning the Cloud - Riak and Multi Data Center ...

Cloning the Cloud - Riak and Multi Data Center Replication (RICON2012)

More Decks by Basho Technologies

Other Decks in Technology

Featured

Transcript