Find me quickly... and simply.

Find me quickly… … and easily. Jernej Virag @jernejv

– Johnny, newbie developer “Eh, I’ll just use LIKE %…%!”

– Johnny, newbie developer “What do you mean with ‘diﬀerent
words can mean the same things?’” “It takes HOW long to ﬁnd results?” “It takes HOW much work to group results?”

– Johnny, newbie developer “Human reasoning is weird.”

Apache Solr

News Buddy http://news.virag.si https://bitbucket.org/mavrik/news-buddy

Apache Solr • “NoSQL” database for text-search only • Faaaaaast…
if configured properly • Compensates for human weirdness with language awareness • Does categorisation of results (faceting) • Can ﬁnd similar documents

Apache Solr Website backend Database Solr J V M Users
Search queries  Results Updates

– Johnny, newbie developer “Ooooo! So how do I use
it?”

Anatomy H O M E Core /products-si Core /products-en Core
/articles index schema.xml solrconfig.xml index schema.xml solrconfig.xml index schema.xml solrconfig.xml Your data Data type definitions  Schema definition (optional) REST endpoints (handlers) Defaults

Schema.xml • Datatypes in your index - <fieldType> • (!)
Defines how language gets processed -   <analyzer> / <query> • Fields (columns) and their types - <fields> • Defines which data is kept • See NewsBuddy/solr/config/schema.xml Your CREATE TABLES

Solrconfig.xml • Sets handlers which handle REST endpoints • Sets
default parameters so you don’t have to send them with each request • Here you • Set field importance • Enable highlighting, faceting, more-like-this and other functionality • Customise other search parameters • See NewsBuddy/solr/config/solrconfig.xml Your search defaults

Basic setup https://github.com/izacus/solr_example

– Johnny, newbie developer “Woo it runs! … Why are
search results bad?”

Text ﬁelds … Ljubljanske univerze. univerza v ljubljani ljubljanske univerze
ljubljanske univerze ljubljana univerza Index univerza v ljubljani univerza v ljubljani univerza v ljubljana ljubljana univerza MATCH - RESULT! INSERT - ANALYSIS SEARCH QUERY

“Shoe” ≠ “Shoes”  “Ljubljana” ≠ “Ljubljanski” • Use stemmer for
your language • Porter for English, Lemmagen for Slovene • More language tips at wiki:  https://cwiki.apache.org/conﬂuence/display/solr/Language+Analysis

“zagreb” ≠ “Zagreb” • Solr is case-sensitive • If you
want to make it case insensitive use LowerCaseFilterFactory • If you want to boost correct case, generate lower AND proper case tokens in analysis and at query time.

Sanitize your text • Break compound words at analysis and
query time  (e.g. DropBox -> Drop box so both “dropbox” and “drop box” match) • Clean HTML with HTMLStripCharFilterFactory • Collapse special characters to ASCII with ASCIIFoldingFilter  (e.g. Sežana => Sezana so “Sezana” and “Sežana” will be hits) • More tools:  https://cwiki.apache.org/conﬂuence/display/solr/Filter+Descriptions  https://cwiki.apache.org/conﬂuence/display/solr/CharFilterFactories

Analyze, analyze, analyze! • Analyze queries users are making on
your site • Pay special attention to queries with no results • Use SynonymFilterFactory to map common “wrong” words  (e.g. “kitty” => “cat”, “aple” => “Apple”, …)

The holy bible • Apache Solr Reference Guide  https://cwiki.apache.org/conﬂuence/display/solr/ Getting+Started 
      • ! Most Google hits will lead you to http://wiki.apache.org/solr which is signiﬁcantly worse and commonly outdated!

What else? • Highlighting • Faceting • Similar documents •
Autocomplete suggestions

Cool stuﬀ • Solr small core example  https://github.com/izacus/solr_example • Novičar
news search system  https://bitbucket.org/mavrik/news-buddy • Solr lemmatizer for Slovenian, Serbian, Romanian, Bulgarian and some other languages  https://bitbucket.org/mavrik/slovene_lemmatizer  https://www.virag.si/2013/12/solr-slovenian- lemmatizer-updated/

Cool stuﬀ • PySolarized python library with multi-language support  https://www.virag.si/2014/04/project-spotlight-
pysolarized/ • List of Solr Client libraries  https://cwiki.apache.org/conﬂuence/display/solr/Client +APIs

– Johnny, newbie developer “Amahgad, I have to learn something
new!”

?    @jernejv  http://www.virag.si

Find me quickly... and simply.

Find me quickly... and simply.

Jernej Virag

More Decks by Jernej Virag

Other Decks in Programming

Featured

Transcript

Find me quickly… … and easily. Jernej Virag @jernejv

– Johnny, newbie developer “Eh, I’ll just use LIKE %…%!”

– Johnny, newbie developer “What do you mean with ‘diﬀerent

– Johnny, newbie developer “Human reasoning is weird.”

Apache Solr

News Buddy http://news.virag.si https://bitbucket.org/mavrik/news-buddy

Apache Solr • “NoSQL” database for text-search only • Faaaaaast…

Apache Solr Website backend Database Solr J V M Users

– Johnny, newbie developer “Ooooo! So how do I use

Anatomy H O M E Core /products-si Core /products-en Core

Schema.xml • Datatypes in your index - <fieldType> • (!)

Solrconﬁg.xml • Sets handlers which handle REST endpoints • Sets

Basic setup https://github.com/izacus/solr_example

– Johnny, newbie developer “Woo it runs! … Why are

Text ﬁelds … Ljubljanske univerze. univerza v ljubljani ljubljanske univerze

“Shoe” ≠ “Shoes”  “Ljubljana” ≠ “Ljubljanski” • Use stemmer for

“zagreb” ≠ “Zagreb” • Solr is case-sensitive • If you

Sanitize your text • Break compound words at analysis and

Analyze, analyze, analyze! • Analyze queries users are making on

The holy bible • Apache Solr Reference Guide  https://cwiki.apache.org/conﬂuence/display/solr/ Getting+Started

What else? • Highlighting • Faceting • Similar documents •

Cool stuﬀ • Solr small core example  https://github.com/izacus/solr_example • Novičar

Cool stuﬀ • PySolarized python library with multi-language support  https://www.virag.si/2014/04/project-spotlight-

– Johnny, newbie developer “Amahgad, I have to learn something

?    @jernejv  http://www.virag.si