Riak and Distributed Systems Tradeoffs

Slide 1

Slide 1 text

Riak & Distributed Systems Tradeoffs Mark Phillips Dir., Community and Developer Evangelism Tuesday, June 26, 12

Slide 2

Slide 2 text

Engineering is about tradeoffs. Tuesday, June 26, 12

Slide 3

Slide 3 text

Examples • Compilers trade space for speed when inlining code • Image and audio compression trade CPU/ ﬁdelity loss for space • Databases trade consistency for availability in failure scenarios Tuesday, June 26, 12

Slide 4

Slide 4 text

FooDB: The end of the tradeoff game! - Someone lying to you. DERPBase beats the CAP Theorem! - Someone lying to you. Tuesday, June 26, 12

Slide 5

Slide 5 text

Consistency, Availability, Latency Tuesday, June 26, 12

Slide 6

Slide 6 text

CAP • The fundamental, most-discussed tradeoff • When a network partition (message loss) occurs, laws of physics make you choose: • Consistency OR • Availability • No system can “beat the CAP theorem” Tuesday, June 26, 12

Slide 7

Slide 7 text

PACELC • Nuance added by Daniel Abadi • When Partitioned, trade off Availability vs. Consistency • Else • Trade off Latency and Consistency Tuesday, June 26, 12

Slide 8

Slide 8 text

Data Distribution Spread vs. Locality Tuesday, June 26, 12

Slide 9

Slide 9 text

• Location of data is determined based on a hash of the key • Provides even distribution of storage and query load • Trades off advantages gained from locality • range queries • aggregates Tuesday, June 26, 12

Slide 10

Slide 10 text

Consistent Hashing Tuesday, June 26, 12

Slide 11

Slide 11 text

Virtual Nodes • Unit of addressing, concurrency in Riak • Each host manages many vnodes • Riak *could* manage all host-local storage as a unit and gain efﬁciency, but would lose • simplicity in cluster resizing • failure isolation Tuesday, June 26, 12

Slide 12

Slide 12 text

Append-Only Stores, Bitcask Crash safety, speed vs. periodic IO spikes Tuesday, June 26, 12

Slide 13

Slide 13 text

Append-Only Stores • All writes are appends to a ﬁle • This provides crash-safety, fast writes • Tradeoff: must periodically compact/merge ﬁles to reclaim space • Causes periodic pauses while compaction occurs that must be masked/mitigated Tuesday, June 26, 12

Slide 14

Slide 14 text

Bitcask After the append completes, an in-memory structure called a ”keydir” is updated. A keydir is simply a hash table that maps every key in a Bitcask to a fixed-size structure giving the file, offset, and size of the most recently written entry for that key. When a write occurs, the keydir is atomically updated with the location of the newest data. The old data is still present on disk, but any new reads will use the latest version available in the keydir. As we’ll see later, the merge process will eventually remove the old value. Reading a value is simple, and doesn’t ever require more than a single disk seek. We look up the key in our keydir, and from there we read the data using the file id, position, and size that are returned from that lookup. In many cases, the operating system’s filesystem read-ahead cache makes this a much faster operation than would be otherwise expected. Tradeoff: Index must fit in memory Low Latency: All reads = hash lookup + 1 seek All writes = append to file Tuesday, June 26, 12

Slide 15

Slide 15 text

Tuesday, June 26, 12

Slide 16

Slide 16 text

Handoff and Rebalancing • When nodes are added to a cluster, data must be rebalanced • Rebalancing causes disk, network load • Tradeoff: speed of convergence vs. effects on cluster performance Tuesday, June 26, 12

Slide 17

Slide 17 text

Vector Clocks • Provide happened-before relationship between events • Riak tags each object with vector clock • Tradeoff: space, speed, complexity for safety Tuesday, June 26, 12

Slide 18

Slide 18 text

Gossip Protocol • Nodes “gossip” their view of cluster state to each other • Tradeoffs: • atomic modiﬁcations of cluster state for no SPOF • complexity for fault tolerance Tuesday, June 26, 12

Slide 19

Slide 19 text

Sane Defaults • Speed vs. Safety • Riak ships with N=3, R=W=2 • Bad for microbenchmarks, good for production use, durability • Mongo ships with W=0 • Good for benchmarks, horrible and insane for durability, production use. Tuesday, June 26, 12

Slide 20

Slide 20 text

Erlang • Best language ever: • for distributed systems glue code • for safety, fault tolerance • Sometimes you want: • Destructive operations • Shared memory Tuesday, June 26, 12

Slide 21

Slide 21 text

NIFs to the rescue? • Use NIFs for speed, interfacing with native code, but: • You make the Erlang VM only as reliable as your C code • NIFs block the scheduler Tuesday, June 26, 12

Slide 22

Slide 22 text

General Tradeoffs We don’t rush to add new features Even popular ones that everyone wants Until we know they wont force us to screw up one of our fundamental tradeoffs! Tuesday, June 26, 12

Slide 23

Slide 23 text

Thanks! http://www.basho.com http://github.com/basho @pharkmillups @basho Tuesday, June 26, 12