Slide 1

Slide 1 text

Riak Replication October 12th 2012 Andrew Thompson [email protected] github.com/Vagabond

Slide 2

Slide 2 text

About the author •OpenACD/mod_erlang_event/gen_smtp •egitd •@ Basho since March 2011 •Lager •Lead replication dev since November 2011 •Quickchecked poolboy

Slide 3

Slide 3 text

Not a sales talk

Slide 4

Slide 4 text

Why? •Clusters span a LAN, replication spans the WAN •Riak doesn’t like intra-cluster latency •Riak doesn’t support rack-aware- claim (yet) •Rack distribution != datacenter distribution

Slide 5

Slide 5 text

Replication is... •Unidirectional or bidirectional •Realtime and/or fullsync •Eventually, eventually consistent •Closed source, part of EDS

Slide 6

Slide 6 text

The evolution of replication

Slide 7

Slide 7 text

KISS replication •Post commit hook fires on every PUT •Relays PUT object to a node in another cluster •Replicated PUTs do not increment vector clocks, change last- modified or fire post-commit hooks •An open source implementation of this idea exists

Slide 8

Slide 8 text

Problems... •Network glitches drop objects •Node -> node connections statically configured •‘Offline’ PUTs also dropped

Slide 9

Slide 9 text

Periodic Fullsync •Every N hours, or when 2 clusters connect •Compute a Merkle tree of all the keys in each vnode •Send the Merkle tree to the sink cluster, compute the differences and send the difference list to the source, which returns the source’s version of the value

Slide 10

Slide 10 text

Problems... •How do you coordinate this? Need to elect a ‘leader’ to coordinate fullsync and proxy all realtime through this as well •Merkle trees based on couch_btree are not Merkle trees •Mixing realtime and fullsync on the same connection causes back- pressure problems

Slide 11

Slide 11 text

More problems... •Latency on the network can cause Erlang mailboxes to overflow with realtime messages •Customers have already deployed this, so now we need to be backwards compatible

Slide 12

Slide 12 text

Realtime

Slide 13

Slide 13 text

Fullsync

Slide 14

Slide 14 text

Solutions •Add a bounded queue for realtime that will drop objects in an overflow situation •Add ACKs to realtime messages so that we don’t fill the TCP send buffer •Once the client receives a Merkle tree, convert it to key/object- hash list, sort and compare

Slide 15

Slide 15 text

More problems... •After Riak 1.0, putting me in charge of replication •Computing differences on the client and storing that list in RAM until you’re done is bad for large difference sets •Computing differences on the sink is more of a round trip

Slide 16

Slide 16 text

Even more problems •Fullsync and realtime objects are sent the same over the wire, greatly confusing the ACK strategy •No fullsync back-pressure •All replicated puts are done in a spawn()

Slide 17

Slide 17 text

Solutions •Split code for fullsync and realtime and allow the fullsync behaviour to be negotiated at connection time •Implement a new Merkle-free strategy that compares keylists on the source cluster and streams differences to the sink

Slide 18

Slide 18 text

Solutions, continued •Differentiate between realtime and fullsync objects •Worker pool for replicated PUTs •Fullsync back-pressure •Riak 1.1.0

Slide 19

Slide 19 text

Achievements Unlocked •SSL support •NAT support •More control about how buckets replicate •Offload some work from the leader nodes •Riak 1.2.0 & more engineers

Slide 20

Slide 20 text

Inevitably; more problems •Blocking while waiting for ACKs is bad when network latency is high (> 60ms) •Disk order != keylist order -> random disk seeks •Enqueueing and dequeueing in the same process causes Erlang mailbox ordering problems

Slide 21

Slide 21 text

Solutions •Instead of GETting keys during compare, put the different keys i a bloom filter and then re-fold the vnode - traversal is in better order •Allow multiple ACKs for realtime/ fullsync to be in flight at once •Enqueue realtime in other process

Slide 22

Slide 22 text

Riak CS replication •Different data model; blocks and manifests •Replicating manifests before all its blocks is bad •Want new files to be visible on other clusters quickly

Slide 23

Slide 23 text

Solutions •Realtime replicate only manifests, not blocks •Fullsync blocks •Add ‘proxy get’ to get missing blocks on-demand

Slide 24

Slide 24 text

And yet... •Realtime replication hits a throughput wall, proxying through a single node on each cluster is a big bottleneck •Building keylists for fullsync can take a long, long time on large vnodes and they can’t be reused by other/later fullsyncs

Slide 25

Slide 25 text

Woe, continued •Fullsync also proxies everything through the leader node AND on the same connection as realtime, even worse for the bottleneck •Replication terminology/ configuration is very confusing (listener, site, client, server) •Realtime prone to dropping things

Slide 26

Slide 26 text

Brave new world Replication’s not so dystopian future

Slide 27

Slide 27 text

Goals •Build a new architecture for replication that scales to higher throughputs and larger keyspaces •Make configuration simpler and more understandable •Make replication more extensible and flexible for the future

Slide 28

Slide 28 text

BNW - Realtime •Every node in cluster A connects to a node in cluster B and streams realtime updates •Realtime queues are per-node, not per-connection and are multi- consumer & bounded •Queue contents are handed off on node shutdown, not lost

Slide 29

Slide 29 text

BMW Realtime, continued •No need for ACKs/back-pressure, can use TCP for congestion control because there’s no other messages on the connection

Slide 30

Slide 30 text

Before

Slide 31

Slide 31 text

After

Slide 32

Slide 32 text

BNW - Fullsync •Fullsync done over a node -> node connection for the nodes that own the vnode being synced •Coordination/scheduling done via a separate connection on leader •Use AAE trees for exchange, not keylists (much cheaper to build and compare)

Slide 33

Slide 33 text

Before

Slide 34

Slide 34 text

After

Slide 35

Slide 35 text

BNW - Configuration •Use ‘source’ and ‘sink’ terminology consistently •Auto-discover as much as possible to reduce configuration overhead •Better per-connection and per- cluster reporting

Slide 36

Slide 36 text

Tomorrow’s World •Strong consistency support •Generic ‘proxy get’, not just for CS •Replicate between any ring sizes

Slide 37

Slide 37 text

Special Thanks •Andy Gross •Jon Meredith •Chris Tilt •Dave Parfitt •Everyone at Basho •Our patient replication customers

Slide 38

Slide 38 text

Questions?