Searching and Accessing Data in Riak

Searching and Accessing Data in Riak September 20, 2012

•  Andy Gross •  Architect • @argv0 on Twitter •  [email protected]
•  Shanley Kane •  director of product management •  @shanley on Twitter •  [email protected] Us

How Do You Get Data Out? •  Key/object operations • 
Riak Search •  Secondary indexing •  Map Reduce

At a High Level

Object / Key Operations •  Key/value pairs stored in buckets
•  Any data type, objects are stored as binaries on disk •  Majority of operations in Riak are reading, writing and deleting objects KEY VALUE KEY VALUE KEY VALUE bucket

Interfaces •  HTTP API •  Protobufs API •  Client libraries:
•  Ruby, Node.js, Java, Python, Perl, OCaml, Erlang, PHP, C, Squeak, Smalltalk, Pharoah, Clojure, Scala, Haskell, Lisp, Go, .NET, and more •  Supported by either Basho or the community

Options for Searching / Aggregating Data •  Riak Search • 
Full-text search index for Riak •  Automatically extracts, analyzes and indexes data •  Robust, Lucene-like query language •  Scoring and ranking

Options for Searching / Aggregating Data •  Secondary Indexes • 
Tag objects in Riak with key/value metadata •  Query by exact match or range

Options for Searching / Aggregating Data •  Map Reduce • 
Designed for small, low-latency jobs, not huge batch jobs like Hadoop •  Javascript and Erlang support

Riak Search… GOOD for: •  Text, text, text •  User
bios, blog posts, articles, other documents •  Getting information fast and easily •  Indexing JSON data

Riak Search Features •  Easy, robust query language returns list
of matching bucket/key pairs •  Exact matches •  Wildcards •  Inclusive / exclusive ranges •  AND / OR / NOT •  Grouping •  Proximity searches

Riak Search Features •  Support for various mime types • 
Custom extractors •  HTTP and Protobufs API support •  Use search query instead of key list or bucket as input for M/R job

Riak Search: NOT GOOD for •  Anything non-text •  If
you just need some basic tagging

Riak Search User Example •  Saving/sharing application •  Searches users’
clips and tags •  Number of optimizations made to make Search see a 100x performance increase for some of their queries •  Read their blog!

Riak Search: How it Works •  Enable Search in app.config
and then on a per- bucket basis •  Riak creates a set of virtual nodes to handle search requests

Riak Search: How it Works •  Objects are indexed as
they are written using a pre-commit hook •  Default or custom schemas •  Indexes are replicated around the cluster

Riak Search: How it Works •  Indexes are stored by
term (term-based partitioning) •  Ideal for short queries: if you are querying on a single term, only have to go to one machine •  Downside: AND queries slower, index entries can be unbounded in size, higher latencies for large result sets

Riak Search: How it Works •  New repair mechanism for
Search partitions •  Indexes are replicated across the cluster •  Indexing and querying can be run from any node •  Add new Riak Search nodes easily

Riak Search: Query Example •  Command Line: bin/search-cmd search books
”title :\"See spot run\"" •  HTTP (Solr) Interface http://hostname:8098/solr/select? •  Also use Riak clients, Erlang command line or protobufs API

Secondary Indexes… GOOD for: •  Tagging objects with queryable values
•  Querying on exact or range value for integers and strings •  Storing information about opaque blobs •  Super easy to use, simple searching

Secondary Indexes: NOT good for •  Composite queries (but coming
soon) •  Pagination (coming soon) •  Totally ordered sets •  Large ring sizes…. More on that in a minute

Secondary Indexes: How it Works •  Tags objects with additional
key/value metadata at write time (think: HTTP header) •  Document-based partitioning: the index is stored with the document

Secondary Indexes: How it Works •  N number of indexes
stored •  Query is sent to 1/n partitions, index data is read, and a list of matching keys sent back to the requesting node •  Poor performance on ring sizes of over 512 partitions

Secondary Indexes: How it Works •  2i has anti-entropy built
in: piggybacks off read repair

Secondary Indexes: Query Example •  Insert object: curl -X POST
\ -H 'x-riak-index-twitter_bin: jsmith123' \ -H 'x-riak-index-email_bin: [email protected]' \ -d '...user data...' \ http://localhost:8098/buckets/users/keys/john_smith •  Query: curl localhost:8098/buckets/users/index/twitter_bin/jsmith123

Map Reduce •  Spreads query across many vnodes to take
advantage of parallel processing •  Moves computation (JS/Erlang functions) to data (vnodes)

Map Reduce: GOOD for •  Returning actual objects, not just
keys •  Filtering by tags, counting words, extracting links, analyzing log files, aggregation tasks •  When you know the set of objects

Map Reduce: BAD for •  Querying an entire bucket can
be slow (list-keys) •  When you need predictable latency

Map Reduce: How it Works MAP PHASE: take one piece
of data as input, and produce zero or more results as output.

Map Reduce: How it Works REDUCE PHASE: combine the output
of many “map step” evaluations into one result

Map Reduce: How it Works 1.  Client makes M/R request
2.  Receiving node becomes the coordinator 3.  Request is routed to vnodes by the coordinator 4.  Results are sent back to the coordinating node 5.  Coordinator executes the reduce phase

How It Works REMEMBER: Map Reduce has a different request
pattern than production I/O. This might place excessive load on the cluster or cause performance issues.

Riak In Parallel With Other Solutions •  Riak alongside Elastic
Search (via post commit hooks) •  Backup clusters for M/R and heavy analytics •  Riak -> Hadoop Connector

•  wiki.basho.com/Riak.html •  @basho •  github.com/basho Riak

Questions?

Searching and Accessing Data in Riak

Searching and Accessing Data in Riak

More Decks by Basho Technologies

Other Decks in Technology

Featured

Transcript