Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching and Accessing Data in Riak

Searching and Accessing Data in Riak

An overview of methods for searching and aggregating data in Riak, covering Riak Search, secondary indexes and MapReduce. Reviews use cases and features for each method, when to use which, and the limitations and advantages of each approach. In addition, it covers query examples and the high-level architecture of each method.

Basho Technologies

September 20, 2012
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. •  Andy Gross •  Architect • @argv0 on Twitter •  [email protected]

    •  Shanley Kane •  director of product management •  @shanley on Twitter •  [email protected] Us
  2. How Do You Get Data Out? •  Key/object operations • 

    Riak Search •  Secondary indexing •  Map Reduce
  3. Object / Key Operations •  Key/value pairs stored in buckets

    •  Any data type, objects are stored as binaries on disk •  Majority of operations in Riak are reading, writing and deleting objects KEY VALUE KEY VALUE KEY VALUE bucket
  4. Interfaces •  HTTP API •  Protobufs API •  Client libraries:

    •  Ruby, Node.js, Java, Python, Perl, OCaml, Erlang, PHP, C, Squeak, Smalltalk, Pharoah, Clojure, Scala, Haskell, Lisp, Go, .NET, and more •  Supported by either Basho or the community
  5. Options for Searching / Aggregating Data •  Riak Search • 

    Full-text search index for Riak •  Automatically extracts, analyzes and indexes data •  Robust, Lucene-like query language •  Scoring and ranking
  6. Options for Searching / Aggregating Data •  Secondary Indexes • 

    Tag objects in Riak with key/value metadata •  Query by exact match or range
  7. Options for Searching / Aggregating Data •  Map Reduce • 

    Designed for small, low-latency jobs, not huge batch jobs like Hadoop •  Javascript and Erlang support
  8. Riak Search… GOOD for: •  Text, text, text •  User

    bios, blog posts, articles, other documents •  Getting information fast and easily •  Indexing JSON data
  9. Riak Search Features •  Easy, robust query language returns list

    of matching bucket/key pairs •  Exact matches •  Wildcards •  Inclusive / exclusive ranges •  AND / OR / NOT •  Grouping •  Proximity searches
  10. Riak Search Features •  Support for various mime types • 

    Custom extractors •  HTTP and Protobufs API support •  Use search query instead of key list or bucket as input for M/R job
  11. Riak Search User Example •  Saving/sharing application •  Searches users’

    clips and tags •  Number of optimizations made to make Search see a 100x performance increase for some of their queries •  Read their blog!
  12. Riak Search: How it Works •  Enable Search in app.config

    and then on a per- bucket basis •  Riak creates a set of virtual nodes to handle search requests
  13. Riak Search: How it Works •  Objects are indexed as

    they are written using a pre-commit hook •  Default or custom schemas •  Indexes are replicated around the cluster
  14. Riak Search: How it Works •  Indexes are stored by

    term (term-based partitioning) •  Ideal for short queries: if you are querying on a single term, only have to go to one machine •  Downside: AND queries slower, index entries can be unbounded in size, higher latencies for large result sets
  15. Riak Search: How it Works •  New repair mechanism for

    Search partitions •  Indexes are replicated across the cluster •  Indexing and querying can be run from any node •  Add new Riak Search nodes easily
  16. Riak Search: Query Example •  Command Line: bin/search-cmd search books

    ”title :\"See spot run\"" •  HTTP (Solr) Interface http://hostname:8098/solr/select? •  Also use Riak clients, Erlang command line or protobufs API
  17. Secondary Indexes… GOOD for: •  Tagging objects with queryable values

    •  Querying on exact or range value for integers and strings •  Storing information about opaque blobs •  Super easy to use, simple searching
  18. Secondary Indexes: NOT good for •  Composite queries (but coming

    soon) •  Pagination (coming soon) •  Totally ordered sets •  Large ring sizes…. More on that in a minute
  19. Secondary Indexes: How it Works •  Tags objects with additional

    key/value metadata at write time (think: HTTP header) •  Document-based partitioning: the index is stored with the document
  20. Secondary Indexes: How it Works •  N number of indexes

    stored •  Query is sent to 1/n partitions, index data is read, and a list of matching keys sent back to the requesting node •  Poor performance on ring sizes of over 512 partitions
  21. Secondary Indexes: Query Example •  Insert object: curl -X POST

    \ -H 'x-riak-index-twitter_bin: jsmith123' \ -H 'x-riak-index-email_bin: [email protected]' \ -d '...user data...' \ http://localhost:8098/buckets/users/keys/john_smith •  Query: curl localhost:8098/buckets/users/index/twitter_bin/jsmith123
  22. Map Reduce •  Spreads query across many vnodes to take

    advantage of parallel processing •  Moves computation (JS/Erlang functions) to data (vnodes)
  23. Map Reduce: GOOD for •  Returning actual objects, not just

    keys •  Filtering by tags, counting words, extracting links, analyzing log files, aggregation tasks •  When you know the set of objects
  24. Map Reduce: BAD for •  Querying an entire bucket can

    be slow (list-keys) •  When you need predictable latency
  25. Map Reduce: How it Works MAP PHASE: take one piece

    of data as input, and produce zero or more results as output.
  26. Map Reduce: How it Works REDUCE PHASE: combine the output

    of many “map step” evaluations into one result
  27. Map Reduce: How it Works 1.  Client makes M/R request

    2.  Receiving node becomes the coordinator 3.  Request is routed to vnodes by the coordinator 4.  Results are sent back to the coordinating node 5.  Coordinator executes the reduce phase
  28. How It Works REMEMBER: Map Reduce has a different request

    pattern than production I/O. This might place excessive load on the cluster or cause performance issues.
  29. Riak In Parallel With Other Solutions •  Riak alongside Elastic

    Search (via post commit hooks) •  Backup clusters for M/R and heavy analytics •  Riak -> Hadoop Connector