Riak Use Cases: Dissecting the solutions to hard problems

Slide 1

Slide 1 text

Riak Use Cases: Dissecting the Solutions to Hard Problems Andy Gross <@argv0> Chief Architect Basho Technologies Tuesday, June 5, 12

Slide 2

Slide 2 text

Riak Dynamo-inspired key value database with full text search, mapreduce, secondary indices, link traversal, commit hooks, HTTP and binary interfaces, pluggable backends Written in Erlang and C/C++ Open Source, Apache 2 licensed Enterprise features (multi-datacenter replication) and support available from Basho Tuesday, June 5, 12

Slide 3

Slide 3 text

Choosing a NoSQL Database At small scale, everything works. NoSQL DBs trade off traditional features to better support new and emerging use cases Knowledge of the underlying system is essential A lot of NoSQL marketing is bullshit Tuesday, June 5, 12

Slide 4

Slide 4 text

Tradeoffs If you’re evaluating Mongo vs. Riak, or CouchDB vs. Cassandra, you don’t understand your problem By choosing Riak, you’ve already made tradeoffs: Consistency for availability in failure scenarios A rich data/query model for a simple, scalable one A mature technology for a young one Tuesday, June 5, 12

Slide 5

Slide 5 text

Distributed Systems: Desirable Properties Highly Available Low Latency Scalable Fault Tolerant Ops-Friendly Predictable Tuesday, June 5, 12

Slide 6

Slide 6 text

1000s of Deployments Tuesday, June 5, 12

Slide 7

Slide 7 text

User/Metadata Store Comcast User proﬁle storage for xﬁnityTV mobile application Storage of metadata on content providers, and content licensing info Strict latency requirements Tuesday, June 5, 12

Slide 8

Slide 8 text

Notiﬁcation Service Yammer Tuesday, June 5, 12

Slide 9

Slide 9 text

Session Store Mochi Media First Basho Customer (late 2009) Every hit to a Mochi web property = 1 read, maybe one write to Riak Unavailability, high latency = lost ad revenue Tuesday, June 5, 12

Slide 10

Slide 10 text

Document Store Github Pages / Git.io Riak as a web server for Github Pages Webmachine is an awesome HTTP server! Git.io URL shortener Tuesday, June 5, 12

Slide 11

Slide 11 text

Walkie Talkie Voxer Tuesday, June 5, 12

Slide 12

Slide 12 text

Voxer - Initial Stats 11 Riak Nodes ~500GB dataset ~20k peak concurrent users ~4MM daily requests Then something happened... Tuesday, June 5, 12

Slide 13

Slide 13 text

Tuesday, June 5, 12

Slide 14

Slide 14 text

Voxer - Current Stats > 100 nodes ~1TB data incoming / day > 200k concurrent users > 2 billion requests / day Grew from 11 to 80 nodes Dec - Jan Tuesday, June 5, 12

Slide 15

Slide 15 text

Distributed Systems: Desirable Properties High Availability Low Latency Horizontal Scalability Fault Tolerance Ops-Friendliness Predictability Tuesday, June 5, 12

Slide 16

Slide 16 text

High Availability Failure to accept a read/write results in: lost revenue lost users Availability and latency are intertwined Tuesday, June 5, 12

Slide 17

Slide 17 text

Low Latency Sometimes late answer is useless or wrong Users perceive slow sites as unavailable SLA violations SOA approaches magnify SLA failures Tuesday, June 5, 12

Slide 18

Slide 18 text

SOA Who cares about latency? Tuesday, June 5, 12

Slide 19

Slide 19 text

Who cares about latency? Sometimes high latency looks like an outage to the end user. Tuesday, June 5, 12

Slide 20

Slide 20 text

Fault Tolerance Everything fails Especially in the cloud When a host/disk/network fails, what is the impact on Availability Latency Operations staff Tuesday, June 5, 12

Slide 21

Slide 21 text

Predictability “It’s a piece of plumbing; it has never been a root cause of any of our problems.” Coda Hale, Yammer Tuesday, June 5, 12

Slide 22

Slide 22 text

Operational Costs Sound familiar? “we chose a bad shard key...” “the master node went down” “the failover script did not run as expected...” “the root cause was traced to a conﬁguration error...” Staying up all night ﬁghting your database does not make you a hero. Tuesday, June 5, 12

Slide 23

Slide 23 text

Consistency, Availability, Latency Tuesday, June 5, 12

Slide 24

Slide 24 text

CAP The fundamental, most-discussed tradeoff When a network partition (message loss) occurs, laws of physics make you choose: Consistency OR Availability No system can “beat the CAP theorem” Tuesday, June 5, 12

Slide 25

Slide 25 text

Data Distribution Tuesday, June 5, 12

Slide 26

Slide 26 text

Location of data is determined based on a hash of the key Provides even distribution of storage and query load Trades off advantages gained from locality range queries aggregates Tuesday, June 5, 12

Slide 27

Slide 27 text

Consistent Hashing Tuesday, June 5, 12

Slide 28

Slide 28 text

Virtual Nodes Unit of addressing, concurrency in Riak Each host manages many vnodes Riak *could* manage all host-local storage as a unit and gain efﬁciency, but would lose simplicity in cluster resizing failure isolation Tuesday, June 5, 12

Slide 29

Slide 29 text

Append-Only Stores, Bitcask Tuesday, June 5, 12

Slide 30

Slide 30 text

Append-Only Stores All writes are appends to a ﬁle This provides crash-safety, fast writes Tradeoff: must periodically compact/merge ﬁles to reclaim space Causes periodic pauses while compaction occurs that must be masked/mitigated Tuesday, June 5, 12

Slide 31

Slide 31 text

Bitcask After the append completes, an in-memory structure called a ”keydir” is updated. A keydir is simply a hash table that maps every key in a Bitcask to a fixed-size structure giving the file, offset, and size of the most recently written entry for that key. When a write occurs, the keydir is atomically updated with the location of the newest data. The old data is still present on disk, but any new reads will use the latest version available in the keydir. As we’ll see later, the merge process will eventually remove the old value. Reading a value is simple, and doesn’t ever require more than a single disk seek. We look up the key in our keydir, and from there we read the data using the file id, position, and size that are returned from that lookup. In many cases, the operating system’s filesystem read-ahead cache makes this a much faster operation than would be otherwise expected. Tradeoff: Index must fit in memory Low Latency: All reads = hash lookup + 1 seek All writes = append to file Tuesday, June 5, 12

Slide 32

Slide 32 text

Tuesday, June 5, 12

Slide 33

Slide 33 text

Handoff and Rebalancing When nodes are added to a cluster, data must be rebalanced Rebalancing causes disk, network load Tradeoff: speed of convergence vs. effects on cluster performance Tuesday, June 5, 12

Slide 34

Slide 34 text

Vector Clocks Provide happened-before relationship between events Riak tags each object with vector clock Tradeoff: space, speed, complexity for safety Tuesday, June 5, 12

Slide 35

Slide 35 text

Gossip Protocol Nodes “gossip” their view of cluster state to each other Tradeoffs: atomic modiﬁcations of cluster state for no SPOF complexity for fault tolerance Tuesday, June 5, 12

Slide 36

Slide 36 text

Sane Defaults Speed vs. Safety Riak ships with N=3, R=W=2 Bad for microbenchmarks, good for production use, durability Mongo ships with W=0 Good for benchmarks, horrible and insane for durability, production use. Tuesday, June 5, 12

Slide 37

Slide 37 text

Erlang Best language ever: for distributed systems glue code for safety, fault tolerance Sometimes you want: Destructive operations Shared memory Tuesday, June 5, 12

Slide 38

Slide 38 text

NIFs to the rescue? Use NIFs for speed, interfacing with native code, but: You make the Erlang VM only as reliable as your C code NIFs block the scheduler Tuesday, June 5, 12

Slide 39

Slide 39 text

Conclusions Over time, operational costs dominate Predictability in: Latency Scalability Failure scenarios ...is essential for managing operational costs When choosing a database, raw throughput is often the least important metric. Tuesday, June 5, 12

Slide 40

Slide 40 text

Thanks! Visit us at http://www.basho.com Check out our open source code at http://github.com/ basho Follow us on Twitter: @basho We’re hiring! Tuesday, June 5, 12