Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond the Basics: Lucene and Solr

Beyond the Basics: Lucene and Solr

If you are already using Lucene and/or Solr (or even ElasticSearch), then this is the talk for you. We will go beyond the basics of these brilliant open source search platforms. Not only are there many ways to customize Solr through the standard configuration file, but there is so much more. Payloads offer up many possibilities for customization, including the ability to tag word with part of speech information. There is also a lot of ways to extend Lucene and Solr by creating your own filters, query parsers, tokenizers, token filters, and even highlighters with some simple Java code. If search is a core feature of your application, then you need to be using these advanced features to set yourself apart.

Scott Smerchek

May 03, 2013
Tweet

More Decks by Scott Smerchek

Other Decks in Programming

Transcript

  1. Know your analyzers • a few analyzers will probably fit

    your needs • use the appropriate set of analyzers for your data • don’t analyze your key field • char filters, tokenizers, token filters
  2. Char Filters • straight up character replacement • can be

    chained • MappingCharFilter • PatternReplaceCharFilter • HTMLStripCharFilter
  3. Tokenizers • WhitespaceTokenizer • StandardTokenizer • KeywordTokenizer • UAX29URLEmail •

    PatternTokenizer • PathHierarchyTokenizer • ancestors • decedents • LetterTokenizer • splits stream of characters into series of tokens • Can only use one
  4. Token Filters • StopWord • KeepWord • LowerCase • Trim

    • PatternReplace • ASCIIFolding • Phonetic • ICU* • tokens outputted from the tokenizer passed through series of filters • easily composable; order matters
  5. Stemmers • reduce words to their common form • -ed,

    -ing, -s, -es and other english plural forms
  6. N-Grams and Shingles • gram => [g] [gr] [gra] [gram]

    • this is a shingle => [this] [this is] [this is a] [this is a shingle] • more efficient prefix queries (good for autocomplete) • bloat index
  7. Synonym Filter • normalize synonyms for better matching • specify

    mappings in synonym.txt i-pod, i pod => ipod sea biscuit, sea biscit => seabiscuit
  8. Payloads • any information you want encoded as bytes on

    each token • can be used at at query time to affect scoring • if you don’t need the metadata after indexing, then just use token attributes
  9. DelimitedPayload Filter • pass payloads from an external source •

    must use Whitespace or some tokenizer that doesn’t consume your delimiter The quick|1.4 brown fox|5.2 jumped over|3.4 the lazy|0.0 dog.
  10. Testing • a must for custom plugins • great for

    ensuring your schema works as intended • ensure future versions don’t break you
  11. Codecs • flexible indexing • customize how fields are stored/

    retrieved • SimpleText • Speed up Primary Key lookups
  12. Ranking: tf/idf • tf(t in d) = term frequency •

    # of time t appears in doc • idf(t) = inverse document frequency • # of docs in which t appears
  13. Know your parsers • Only a couple of parsers are

    actually suitable for users • Some are great for programmtic searches • Make it easy for your users, but also allow advanced functionality
  14. DisMax Query Parser • user-friendly • query multiple fields and

    take the best score author:“football” OR title:“football” OR category:“football” football • you can also boost certain fields
  15. Query Debugging • ‘explain’ - how the score was calculated

    • ‘explainOther’ - what about this other document? • examine query performance • visualize - explain.solr.pl • in results - fl=[explain style=nl|text|html]
  16. Spell Checker • suggest a different query with corrected spellings

    • previously required separate index; now can be built directly
  17. Analyzing Suggester • use for phrase suggestions • ghost chr..

    => “The Ghost of Christmas Past” • plug in to the spellcheck component • requires external suggestions • build with query logs • also see: Fuzzy
  18. Merge Factor • Determines how frequently segments will be merged

    during indexing • Number is the maximum # of segments • Lower # => faster search; slow indexing • Higher # => slower search; faster indexing
  19. Caching • Filter Cache • Caches unordered sets of doc

    ids that match a key (query) • Used for results of fq filter queries and faceting • Field Value Cache • Primarily used by faceting • Query Result Cache • Stores ordered sets of doc ids • Document Cache • Stores Lucene Document objects • User/Generic Caches • Generic object cache for custom Solr plugins
  20. Security • implement via your servlet container • basic auth

    • As of SOLR-4470, distributed requests can also be authenticated • or just restrict to your app server • require authorization for update requests with some special sauce
  21. How to Contribute • Get the latest code with SVN

    or Git • Create a branch and make your change • Don’t forget tests • Create a patch file • Open a JIRA issue describing the issue or feature you’ve solved and upload your patch • Hope that they accept it