Slide 1

Slide 1 text

Searching and Accessing Data in Riak September 20, 2012

Slide 2

Slide 2 text

•  Andy Gross •  Architect • @argv0 on Twitter •  [email protected] •  Shanley Kane •  director of product management •  @shanley on Twitter •  [email protected] Us

Slide 3

Slide 3 text

How Do You Get Data Out? •  Key/object operations •  Riak Search •  Secondary indexing •  Map Reduce

Slide 4

Slide 4 text

At a High Level

Slide 5

Slide 5 text

Object / Key Operations •  Key/value pairs stored in buckets •  Any data type, objects are stored as binaries on disk •  Majority of operations in Riak are reading, writing and deleting objects KEY VALUE KEY VALUE KEY VALUE bucket

Slide 6

Slide 6 text

Interfaces •  HTTP API •  Protobufs API •  Client libraries: •  Ruby, Node.js, Java, Python, Perl, OCaml, Erlang, PHP, C, Squeak, Smalltalk, Pharoah, Clojure, Scala, Haskell, Lisp, Go, .NET, and more •  Supported by either Basho or the community

Slide 7

Slide 7 text

Options for Searching / Aggregating Data •  Riak Search •  Full-text search index for Riak •  Automatically extracts, analyzes and indexes data •  Robust, Lucene-like query language •  Scoring and ranking

Slide 8

Slide 8 text

Options for Searching / Aggregating Data •  Secondary Indexes •  Tag objects in Riak with key/value metadata •  Query by exact match or range

Slide 9

Slide 9 text

Options for Searching / Aggregating Data •  Map Reduce •  Designed for small, low-latency jobs, not huge batch jobs like Hadoop •  Javascript and Erlang support

Slide 10

Slide 10 text

Riak Search… GOOD for: •  Text, text, text •  User bios, blog posts, articles, other documents •  Getting information fast and easily •  Indexing JSON data

Slide 11

Slide 11 text

Riak Search Features •  Easy, robust query language returns list of matching bucket/key pairs •  Exact matches •  Wildcards •  Inclusive / exclusive ranges •  AND / OR / NOT •  Grouping •  Proximity searches

Slide 12

Slide 12 text

Riak Search Features •  Support for various mime types •  Custom extractors •  HTTP and Protobufs API support •  Use search query instead of key list or bucket as input for M/R job

Slide 13

Slide 13 text

Riak Search: NOT GOOD for •  Anything non-text •  If you just need some basic tagging

Slide 14

Slide 14 text

Riak Search User Example •  Saving/sharing application •  Searches users’ clips and tags •  Number of optimizations made to make Search see a 100x performance increase for some of their queries •  Read their blog!

Slide 15

Slide 15 text

Riak Search: How it Works •  Enable Search in app.config and then on a per- bucket basis •  Riak creates a set of virtual nodes to handle search requests

Slide 16

Slide 16 text

Riak Search: How it Works •  Objects are indexed as they are written using a pre-commit hook •  Default or custom schemas •  Indexes are replicated around the cluster

Slide 17

Slide 17 text

Riak Search: How it Works •  Indexes are stored by term (term-based partitioning) •  Ideal for short queries: if you are querying on a single term, only have to go to one machine •  Downside: AND queries slower, index entries can be unbounded in size, higher latencies for large result sets

Slide 18

Slide 18 text

Riak Search: How it Works •  New repair mechanism for Search partitions •  Indexes are replicated across the cluster •  Indexing and querying can be run from any node •  Add new Riak Search nodes easily

Slide 19

Slide 19 text

Riak Search: Query Example •  Command Line: bin/search-cmd search books ”title :\"See spot run\"" •  HTTP (Solr) Interface http://hostname:8098/solr/select? •  Also use Riak clients, Erlang command line or protobufs API

Slide 20

Slide 20 text

Secondary Indexes… GOOD for: •  Tagging objects with queryable values •  Querying on exact or range value for integers and strings •  Storing information about opaque blobs •  Super easy to use, simple searching

Slide 21

Slide 21 text

Secondary Indexes: NOT good for •  Composite queries (but coming soon) •  Pagination (coming soon) •  Totally ordered sets •  Large ring sizes…. More on that in a minute

Slide 22

Slide 22 text

Secondary Indexes: How it Works •  Tags objects with additional key/value metadata at write time (think: HTTP header) •  Document-based partitioning: the index is stored with the document

Slide 23

Slide 23 text

Secondary Indexes: How it Works •  N number of indexes stored •  Query is sent to 1/n partitions, index data is read, and a list of matching keys sent back to the requesting node •  Poor performance on ring sizes of over 512 partitions

Slide 24

Slide 24 text

Secondary Indexes: How it Works •  2i has anti-entropy built in: piggybacks off read repair

Slide 25

Slide 25 text

Secondary Indexes: Query Example •  Insert object: curl -X POST \ -H 'x-riak-index-twitter_bin: jsmith123' \ -H 'x-riak-index-email_bin: [email protected]' \ -d '...user data...' \ http://localhost:8098/buckets/users/keys/john_smith •  Query: curl localhost:8098/buckets/users/index/twitter_bin/jsmith123

Slide 26

Slide 26 text

Map Reduce •  Spreads query across many vnodes to take advantage of parallel processing •  Moves computation (JS/Erlang functions) to data (vnodes)

Slide 27

Slide 27 text

Map Reduce: GOOD for •  Returning actual objects, not just keys •  Filtering by tags, counting words, extracting links, analyzing log files, aggregation tasks •  When you know the set of objects

Slide 28

Slide 28 text

Map Reduce: BAD for •  Querying an entire bucket can be slow (list-keys) •  When you need predictable latency

Slide 29

Slide 29 text

Map Reduce: How it Works MAP PHASE: take one piece of data as input, and produce zero or more results as output.

Slide 30

Slide 30 text

Map Reduce: How it Works REDUCE PHASE: combine the output of many “map step” evaluations into one result

Slide 31

Slide 31 text

Map Reduce: How it Works 1.  Client makes M/R request 2.  Receiving node becomes the coordinator 3.  Request is routed to vnodes by the coordinator 4.  Results are sent back to the coordinating node 5.  Coordinator executes the reduce phase

Slide 32

Slide 32 text

How It Works REMEMBER: Map Reduce has a different request pattern than production I/O. This might place excessive load on the cluster or cause performance issues.

Slide 33

Slide 33 text

Riak In Parallel With Other Solutions •  Riak alongside Elastic Search (via post commit hooks) •  Backup clusters for M/R and heavy analytics •  Riak -> Hadoop Connector

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

•  wiki.basho.com/Riak.html •  @basho •  github.com/basho Riak

Slide 36

Slide 36 text

Questions?