Slide 1

Slide 1 text

Riak Overload September 13, 2012 Thursday, September 13, 12

Slide 2

Slide 2 text

• Mark Phillips • @pharkmillups • [email protected] Me Thursday, September 13, 12

Slide 3

Slide 3 text

Thursday, September 13, 12

Slide 4

Slide 4 text

What’s in store? • At a High Level • For Developers • Under the Hood • When and Why • Riak and NoSQL • Etc. Thursday, September 13, 12

Slide 5

Slide 5 text

At a High Level Thursday, September 13, 12

Slide 6

Slide 6 text

• Dynamo-inspired key/value store • with some extras: search, MapReduce, 2i, links, pre- and post-commit hooks, pluggable backends, HTTP and binary interfaces • Written in Erlang with C/C++ • Open source under Apache 2 License Riak Thursday, September 13, 12

Slide 7

Slide 7 text

Riak History • Started internally at Basho in 2007 • Deployed in production the same year • Used at data store for Basho’s SaaS • Open sourced in August 2009; Basho “pivots” • Hit v.1.0 in September 2011 • Now being used by 1000s in production • Basho sells commercial extensions to Riak Thursday, September 13, 12

Slide 8

Slide 8 text

Riak’s Design Goals (1) • High-availability • Low-latency • Horizontal Scalability • Fault Tolerance • Ops Friendliness • Predictability Thursday, September 13, 12

Slide 9

Slide 9 text

Riak’s Design Goals (2) • Design Informed by Brewer’s CAP Theorem and Amazon’s Dynamo Paper • Riak is tuned to o!er availability above all else • Developers can tune for consistency (more on this later) Thursday, September 13, 12

Slide 10

Slide 10 text

Masterless; deployed as a cluster of nodes Thursday, September 13, 12

Slide 11

Slide 11 text

For Developers Thursday, September 13, 12

Slide 12

Slide 12 text

Riak is a database that stores keys against values. Keys are grouped into a higher-level namespace called buckets. Thursday, September 13, 12

Slide 13

Slide 13 text

Riak doesn’t care what you store. It will accept any data type; things are stored on disk as binaries. Thursday, September 13, 12

Slide 14

Slide 14 text

key Thursday, September 13, 12

Slide 15

Slide 15 text

key value Thursday, September 13, 12

Slide 16

Slide 16 text

key value bucket Thursday, September 13, 12

Slide 17

Slide 17 text

key value bucket key value key value key value Thursday, September 13, 12

Slide 18

Slide 18 text

Two APIs 1. HTTP (just like the web) 2. Protocol Bu!ers (thank you, Google) Thursday, September 13, 12

Slide 19

Slide 19 text

Querying GET/PUT/DELETE MapReduce Full-Text Search Secondary Indexes (2i) Thursday, September 13, 12

Slide 20

Slide 20 text

Tunable Consistency • n_val - number of replica to store; bucket- level setting. Defaults to “3”. • w - number of replicas required for a successful write; Defaults to “2”. • r - number of replica acks required for a successful read. request-level setting. Defaults to “2”. • Tweak consistency vs. availability Thursday, September 13, 12

Slide 21

Slide 21 text

Client Libraries Ruby, Node.js, Java, Python, Perl, OCaml, Erlang, PHP, C, Squeak, Smalltalk, Pharoah, Clojure, Scala, Haskell, Lisp, Go, .NET, Play, and more (supported by either Basho or the community). Thursday, September 13, 12

Slide 22

Slide 22 text

Under the Hood Thursday, September 13, 12

Slide 23

Slide 23 text

Consistent Hashing and Replicas Virtual Nodes Vector Clocks Gossiping Append-only stores Hando! and Rebalancing Erlang/OTP Thursday, September 13, 12

Slide 24

Slide 24 text

Consistent Hashing • 160-bit integer keyspace 0 2160/2 2160/4 Thursday, September 13, 12

Slide 25

Slide 25 text

Consistent Hashing • 160-bit integer keyspace • divided into !xed number of evenly-sized partitions 32 partitions 0 2160/2 2160/4 Thursday, September 13, 12

Slide 26

Slide 26 text

Consistent Hashing • 160-bit integer keyspace • divided into !xed number of evenly-sized partitions • partitions are claimed by nodes in the cluster 32 partitions node 0 node 1 node 2 node 3 0 2160/2 2160/4 Thursday, September 13, 12

Slide 27

Slide 27 text

Consistent Hashing • 160-bit integer keyspace • divided into !xed number of evenly-sized partitions • partitions are claimed by nodes in the cluster • replicas go to the N partitions following the key node 0 node 1 node 2 node 3 Thursday, September 13, 12

Slide 28

Slide 28 text

Consistent Hashing • 160-bit integer keyspace • divided into !xed number of evenly-sized partitions • partitions are claimed by nodes in the cluster • replicas go to the N partitions following the key node 0 node 1 node 2 node 3 hash(“meetups/nycdevops”) N=3 Thursday, September 13, 12

Slide 29

Slide 29 text

Disaster Scenario Thursday, September 13, 12

Slide 30

Slide 30 text

Disaster Scenario • node fails X X X X X X X X Thursday, September 13, 12

Slide 31

Slide 31 text

Disaster Scenario • node fails • requests go to fallback X X X X X X X X hash(“meetups/nycdevops”) Thursday, September 13, 12

Slide 32

Slide 32 text

Disaster Scenario • node fails • requests go to fallback • node comes back hash(“meetups/nycdevops”) Thursday, September 13, 12

Slide 33

Slide 33 text

Disaster Scenario • node fails • requests go to fallback • node comes back • “Hando"” - data returns to recovered node hash(“meetups/nycdevops”) Thursday, September 13, 12

Slide 34

Slide 34 text

Disaster Scenario • node fails • requests go to fallback • node comes back • “Hando"” - data returns to recovered node • normal operations resume hash(“meetups/nycdevops”) Thursday, September 13, 12

Slide 35

Slide 35 text

Virtual Nodes • Each physical machine runs a certain number of Vnodes • Unit of addressing, concurrency in Riak • Storage not tied to physical assets • Enables dynamic rebalancing of data when cluster topology changes Thursday, September 13, 12

Slide 36

Slide 36 text

Vector Clocks • Data structure used to reason about causality at the object level • Provides happened-before relationship between events • Each object in Riak has a vector clock* • Trade o! space, speed, complexity for safety Thursday, September 13, 12

Slide 37

Slide 37 text

Hando! and Rebalancing • When cluster topology changes, data must be rebalanced • Hando! and rebalancing happen in the background; no manual intervention required* • Trade o! speed of convergence vs. e!ects on cluster performance Thursday, September 13, 12

Slide 38

Slide 38 text

Gossip Protocol • Nodes “gossip” their view of cluster state • Enables nodes to store minimal cluster state • Can lead to network chatiness; in OTP, all nodes are fully-connected Thursday, September 13, 12

Slide 39

Slide 39 text

Append-only Stores • Riak has a pluggable backend architecture • Bitcask, LevelDB are used the most in production depending on use-case • All writes are appends to a "le • This provide crash safety and fast writes • Tradeo! - periodic, background compaction is required Thursday, September 13, 12

Slide 40

Slide 40 text

Erlang/OTP • Shared-nothing, immutable, message- passing, functional, concurrent • Distributed systems primitives in core language • OTP (Open Telecom Platform) • Ericsson AXD-301: 99.9999999% uptime (31ms/year) Thursday, September 13, 12

Slide 41

Slide 41 text

Riak: when and why Thursday, September 13, 12

Slide 42

Slide 42 text

When Might Riak Make Sense When you have enough data to require >1 physical machine (preferably >4) When availability is more important than consistency (think “critical data”on “big data”) When your data can be modeled as keys and values; don’t be afraid to denormalize Thursday, September 13, 12

Slide 43

Slide 43 text

User/MetaData Store • User pro"le storage for x"nityTV Mobile app • Storage of metadata on content providers and licensing • Strict Latency requirements Thursday, September 13, 12

Slide 44

Slide 44 text

Noti"cations Thursday, September 13, 12

Slide 45

Slide 45 text

Session Storage • First Basho customer in 2009 • Every hit to a Mochi web property results in at least one read, maybe write to Riak • Unavailability or high latency = lost ad revenue Thursday, September 13, 12

Slide 46

Slide 46 text

Ad Serving • OpenX will serve ~4T ad in 2012 • Started with CouchDB and Cassandra for various parts of infrastructure • Now consolidating on Riak and Riak Core Thursday, September 13, 12

Slide 47

Slide 47 text

Riak for All Storage: Voxer Thursday, September 13, 12

Slide 48

Slide 48 text

Voxer: Initial Stats • 11 Riak nodes (switched from CouchDB) • 100s of GBs • ~20k Peak Concurrent Users • ~4MM Daily Request Thursday, September 13, 12

Slide 49

Slide 49 text

Thursday, September 13, 12

Slide 50

Slide 50 text

Voxer: Post Growth • ~60 Nodes total in prod • 100s of TBs of data (>1TB daily) • ~400k Concurrent Users • Billions of daily Requests Thursday, September 13, 12

Slide 51

Slide 51 text

Riak and NoSQL Thursday, September 13, 12

Slide 52

Slide 52 text

• At small scale, everything works • NoSQL DBs trade o! traditional features to better support new and emerging use cases • Knowledge of the underlying system is essential • A lot of NoSQL Marketing is still bullshit Choosing a NoSQL Database Thursday, September 13, 12

Slide 53

Slide 53 text

NoSQL by Data Model • Key/Value - Riak, Redis, Voldemort, Cassandra* • Document - MongoDB, CouchDB • Column(esque) - Hbase* • Graph - Neo4J Thursday, September 13, 12

Slide 54

Slide 54 text

NoSQL by Distribution • Masterless - Riak, Voldemort, Cassandra • Master/Slave - MongoDB, Hbase*, CouchDB, Redis* Thursday, September 13, 12

Slide 55

Slide 55 text

Etc... Thursday, September 13, 12

Slide 56

Slide 56 text

New in Riak 1.2 • LevelDB Improvements • FreeBSD Support • New Cluster Admin Tools • Folsom for Stats • KV and Search Repair work • Much much more Thursday, September 13, 12

Slide 57

Slide 57 text

What needs "xing in Riak? • Active AE • Object Compactness • Rack Awareness • Ring Sizing Thursday, September 13, 12

Slide 58

Slide 58 text

Future Work • Active Anti Entropy • Bona"de Data Types • Deeper Solr Integration • Consistency • Lots of other hotness Thursday, September 13, 12

Slide 59

Slide 59 text

http://ricon2012.com When and where? Wednesday, October 10 through Thursday, October 11 at the W Hotel in downtown San Francisco. Thursday, September 13, 12

Slide 60

Slide 60 text

• wiki.basho.com/Riak.html • @basho • github.com/basho Riak Thursday, September 13, 12

Slide 61

Slide 61 text

• Mark Phillips • @pharkmillups • [email protected] Questions? Thursday, September 13, 12