Beyond the Basics: Lucene and Solr

Beyond the Basics: Lucene and Solr Scott Smerchek @smerchek scottsmerchek.com

Indexing and Analyzation

Inverted Index

Know your analyzers • a few analyzers will probably fit
your needs • use the appropriate set of analyzers for your data • don’t analyze your key field • char filters, tokenizers, token filters

Char Filters • straight up character replacement • can be
chained • MappingCharFilter • PatternReplaceCharFilter • HTMLStripCharFilter

Tokenizers • WhitespaceTokenizer • StandardTokenizer • KeywordTokenizer • UAX29URLEmail •
PatternTokenizer • PathHierarchyTokenizer • ancestors • decedents • LetterTokenizer • splits stream of characters into series of tokens • Can only use one

Token Filters • StopWord • KeepWord • LowerCase • Trim
• PatternReplace • ASCIIFolding • Phonetic • ICU* • tokens outputted from the tokenizer passed through series of ﬁlters • easily composable; order matters

Stemmers • reduce words to their common form • -ed,
-ing, -s, -es and other english plural forms

N-Grams and Shingles • gram => [g] [gr] [gra] [gram]
• this is a shingle => [this] [this is] [this is a] [this is a shingle] • more efﬁcient preﬁx queries (good for autocomplete) • bloat index

WordDelimiter Filter • similar to standard tokenizer • strip punctuation
and handle word parts (hypens, camelcase, etc)

Synonym Filter • normalize synonyms for better matching • specify
mappings in synonym.txt i-pod, i pod => ipod sea biscuit, sea biscit => seabiscuit

DelimitedPayload Filter

Payloads • any information you want encoded as bytes on
each token • can be used at at query time to affect scoring • if you don’t need the metadata after indexing, then just use token attributes

DelimitedPayload Filter • pass payloads from an external source •
must use Whitespace or some tokenizer that doesn’t consume your delimiter The quick|1.4 brown fox|5.2 jumped over|3.4 the lazy|0.0 dog.

Create Your Own Token Filter

Aside: Testing

Testing • a must for custom plugins • great for
ensuring your schema works as intended • ensure future versions don’t break you

Codecs • ﬂexible indexing • customize how ﬁelds are stored/
retrieved • SimpleText • Speed up Primary Key lookups

Questions?

Querying

Ranking: tf/idf • tf(t in d) = term frequency •
# of time t appears in doc • idf(t) = inverse document frequency • # of docs in which t appears

Know your parsers • Only a couple of parsers are
actually suitable for users • Some are great for programmtic searches • Make it easy for your users, but also allow advanced functionality

• {!lucene} • {!ﬁeld} • {!term} • {!boost} • {!func}
• {!dismax} • {!edismax}

DisMax Query Parser • user-friendly • query multiple ﬁelds and
take the best score author:“football” OR title:“football” OR category:“football” football • you can also boost certain ﬁelds

Create Your Own Query Parser

Query Debugging • ‘explain’ - how the score was calculated
• ‘explainOther’ - what about this other document? • examine query performance • visualize - explain.solr.pl • in results - ﬂ=[explain style=nl|text|html]

Questions?

Suggestions

Facet Query • just facet.preﬁx • does pretty well out-of-the-box
• just single words

Spell Checker • suggest a different query with corrected spellings
• previously required separate index; now can be built directly

Analyzing Suggester • use for phrase suggestions • ghost chr..
=> “The Ghost of Christmas Past” • plug in to the spellcheck component • requires external suggestions • build with query logs • also see: Fuzzy

Performance

Merge Factor • Determines how frequently segments will be merged
during indexing • Number is the maximum # of segments • Lower # => faster search; slow indexing • Higher # => slower search; faster indexing

Caching • Filter Cache • Caches unordered sets of doc
ids that match a key (query) • Used for results of fq ﬁlter queries and faceting • Field Value Cache • Primarily used by faceting • Query Result Cache • Stores ordered sets of doc ids • Document Cache • Stores Lucene Document objects • User/Generic Caches • Generic object cache for custom Solr plugins

Security • implement via your servlet container • basic auth
• As of SOLR-4470, distributed requests can also be authenticated • or just restrict to your app server • require authorization for update requests with some special sauce

How to Contribute • Get the latest code with SVN
or Git • Create a branch and make your change • Don’t forget tests • Create a patch ﬁle • Open a JIRA issue describing the issue or feature you’ve solved and upload your patch • Hope that they accept it

What else?

Beyond the Basics: Lucene and Solr

Beyond the Basics: Lucene and Solr

More Decks by Scott Smerchek

Other Decks in Programming

Featured

Transcript