Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Solr

Apache Solr

Friday Talk @ Softonic

Santiago Lizardo

July 15, 2011
Tweet

More Decks by Santiago Lizardo

Other Decks in Science

Transcript

  1. EZ installation • Download and install Tomcat • Download the

    Solr WAR and copy it to webapps • Define the Solr home variable -Dsolr.solr.home=… conf\catalina\localhost\solrconfig. xml
  2. Solrconfig.xml • Lucene indexing parameters • Cache settings • Request

    handler configuration • HTTP cache settings • Search components, response writers, query parsers
  3. Solr Schema • Lucene has no notion of schema –

    Sorting: string vs numeric – No ranges • Defines fields, types and properties • Defines unique key field, default search field • schema.xml – Defines types used in the webapp – Defines fields and types – Define copyfields
  4. Solr data model • Solr maintains a collection of documents

    • A document is a collection of fields & values • A field can occur multiple times in a document • Documents are inmutable – They can be deleted, and a new version added, however.
  5. Solr data model • A document is not a database

    row! • A solr Index store only ONE kind of document definition • A document has typed properties: string, date, integer • Static definition or dynamic type • May be indexed or stored • De-normalize your database into a structured document optimized for the search requirements
  6. Multivalued field • The property is similar to an array

    • Neat solution for storing a set of categories linked to a product or permissions linked to a document
  7. copyField • Copies one field to another at index time

    • Use case: Analyze same field different ways – Copy into a field with a different analyzer – Boost exact-case, exact-punctuation matches – Language translations, thesaurus, soundex • Use case 2: Index multiple fields into single searchable field
  8. Copy fields • Two main uses – To concatenate fields

    – To analyze a field in two different ways
  9. Adding data (indexing) An HTTP POST request to /update <add>

    <doc> <field name=“title”>scooter</field> <field name=”price”>42.30</field> </doc> </add>
  10. Query parameters • Query arguments for HTTP GET/POST to /select

    – “q” the query – “start” (0) offset – “rows” (10) number of docs – “fl” (*) fields to return – “qt” (standard) query type, maps to query handler – “df” (schema) default field to search – “qt” query type (response writer)
  11. Solr Query syntax • Similar to Lucene • Include (+),

    exclude (-) • Field-specific searching: <fieldname>:<fieldvalue> • Wildcard searching: “*” or “?” Ip?d Belk* *deo
  12. Solr Query syntax • paris • city:paris • title:”The Right

    Way” AND text:go • price:[100 TO 300] • -type:sale • te?t • theat* • te*t • test~
  13. Solr Query syntax • Range searching – Timestamp:[2006-01-01 TO *]

    • Proximity searching: “~” – “video ipod”~3 (up to 3 words apart) • Fuzzy searches: “~” – Ipod~ (will find ipod and ipods) – Belkin~0.8 (will find words close spellings)
  14. Debugging query • Add &debugQuery=on to request params• &debugQuery=true is

    your friend • Returns scoring information • Returns parsed form of query • Includes parsed query, explanations, and • search component timings in response
  15. Deleting data • Delete by id – <delete><id>1</id></delete> • Delete

    by query – <delete><query>city:paris</query></delete>
  16. Commiting • Nothing shows up in the index until you

    commit <commit /> • /solr/update <optimize /> sames as commit, merges all index segments
  17. Solr clients (APIs) • HTTP GET/POST (curl or any other

    HTTP client) • SolrJ (embedded or HTTP - Java) • Ruby: solr-ruby, RSolr • Python • C++ • Solrsharp • PHP!
  18. • Roll your own classes – Not difficult, it’s REST

    after all – Some Curl, XML, Json or native PHP array parsing • Using existing libraries – PECL – http://us.php.net/manual/en/book.solr.php – Solr-php-client (follows ZF Coding Standards) – Ez Components ezcSearch
  19. include "bootstrap.php"; $options = array ( 'hostname' => SOLR_SERVER_HOSTNAME, 'login'

    => SOLR_SERVER_USERNAME, 'password' => SOLR_SERVER_PASSWORD, 'port' => SOLR_SERVER_PORT, ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', 334455); $doc->addField('cat', 'Software'); $doc->addField('cat', 'Lucene'); $updateResponse = $client->addDocument($doc); print_r($updateResponse->getResponse());
  20. include "bootstrap.php"; $options = array ( 'hostname' => SOLR_SERVER_HOSTNAME, 'login'

    => SOLR_SERVER_USERNAME, 'password' => SOLR_SERVER_PASSWORD, 'port' => SOLR_SERVER_PORT, ); $client = new SolrClient($options); $updateResponse = $client->deleteByQuery(‘city:Barcelona’); print_r($updateResponse->getResponse());
  21. include "bootstrap.php"; $options = array ( 'hostname' => SOLR_SERVER_HOSTNAME, 'login'

    => SOLR_SERVER_USERNAME, 'password' => SOLR_SERVER_PASSWORD, 'port' => SOLR_SERVER_PORT, ); $client = new SolrClient($options); $query = new SolrQuery(); $query->setQuery('lucene'); $query->setStart(0); $query->setRows(50); $query->addField('cat')->addField('features')->addField('id')- >addField('timestamp'); $query_response = $client->query($query); $response = $query_response->getResponse(); print_r($response);
  22. SolrJ SolrServer solr = new CommonsHttpSolrServer( new URL("http://localhost:8983/solr")); SolrInputDocument doc

    = new SolrInputDocument(); doc.addField("id", "EXAMPLEDOC01"); doc.addField("title", "NOVAJUG SolrJ Example"); solr.add(doc); solr.commit(); // after a batch, not per document solr.optimize(); // periodically, if/when needed
  23. HighlightingParameters hl => true/false to enable/disable highlighting hl.fl => in

    which field apply the highlighting (comma/space separated) hl.snippets => max number of snippets http://localhost:8983/solr/select?q=apple&hl=on&hl.fl=*
  24. Faceting • Facet on: field terms, queries, date ranges &facet=on

    &facet.field=cat &facet.query=price:[0 TO 100] • SimpleFacetParameters
  25. Spell checking • File or index-based dictionaries • Dictionary lookup

    • Using the indexed words itself • Supports pluggable distance algorithms: • Levenstein and JaroWinkler
  26. Using a solr.xml file, you can configure Solr to manage

    several different indexes. <solr persistent="true" sharedLib="lib“> <cores adminPath="/core-admin/"> <core name="books" instanceDir="books" /> <core name="games" instanceDir="games" /> </solr> Multi core
  27. Data import handler • Indexes relational database, XML data, and

    email sources • Supports full and incremental/delta indexing • Highly extensible with custom data sources, transformers, etc
  28. Solr Cell aka ExtractingRequestHandler Leveraging Tika, extracts and indexes rich

    documents such as Word, PDF, HTML, and many other types curl http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' –F [email protected]
  29. Architecture • Scales from – Single solr server – Master/replicants

    (slaves) – Distributed shards • Each solr instance can also have multiple cores
  30. Relevance • Term frequency (TF): number of times a term

    appears in a document • Inverse document frequency (IDF): One over number of times term appears in the index (1/df)
  31. Request handlers • Defines how the query is processed •

    Two main types – StandardRequestHandler • Simple queries – DisMaxRequestHandler • Boost functions • Boost fields • Span query to many fields
  32. Request handler • mini-“servlets” • SearchHandler extensions chain search components

    • Flexible response formatting: • &wt=[json, ruby, xslt, php, phps, javabin, python,velocity]
  33. Dump • http://localhost:8983/solr/debug/dump • Echoes parameters, content streams, and Solr

    web context • Careful with content stream enabled, client could retrieve contents of any file on server or accessible network! [Solution: disable dump request handler]
  34. Ping • http://localhost:8983/solr/admin/ping • If healthcheck configured and file not

    available, error is reported • Executes single configured request and reports failure or OK
  35. Dismax • Minimum match: for optional clauses • Default: 100%

    (pure AND) • Examples: – Pure OR: mm= 0 or mm=0% – At least tow should match=2 – At least 75% should match mm:75%
  36. Search components • Default Components That Power SearchHandler QueryComponent, HighlightComponent,

    FacetComponent, MoreLikeThisComponent, StatsComponent, DebugComponent • Additional Components You Can Configure SpellCheckComponent, QueryElevationComponent, TermsComponent, TermVectorComponent, ClusteringComponent
  37. Boost functions • Allow to influence scoring at runtime •

    Computationally expensive! • Really useful for tuning scoring
  38. Stemming • Reduce terms to their root form • Language

    specific • Many specialised stemmers available – Most european languages
  39. •Inject synonyms for certain terms •Language specific •Best used for

    query time analysis •May inflate the search index too much •Decreases relevancy
  40. Tokenizers And TokenFilters • Analyzers Are Typical Comprised Of Tokenizers

    And TokenFilters • Tokenizer: Controls How Your Text Is Tokenized • TokenFilter: Mutates And Manipulates The Stream Of Tokens • Solr Lets You Mix And Match Tokenizers and TokenFilters • In Your schema.xml To Define Analyzers On The Fly • OOTB Solr Has Factories For 17 Tokenizers and 45 TokenFilters • Many Factories Have Customization Options – Limitless Combinations
  41. Tokenizers And TokenFilters <fieldType name="text" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/>

    <filter class="solr.StopFilterFactory words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory“ generateWordParts="1" generateNumberParts="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory“ protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory“ synonyms="synonyms.txt" expand="true"/> ...
  42. Notable Token(izers|Filters) • StandardTokenizerFactory • WhitespaceTokenizerFactory • KeywordTokenizerFactory • NGramTokenizerFactory

    • PatternTokenizerFactory • EnglishPorterFilterFactory • SynonymFilterFactory • StopFilterFactory • ASCIIFilterFactory • PatternReplaceFilterFactory
  43. Character filters • Used to cleanup text before tokenizing –

    HTMLStripCharFilter (strips html, xml, js, css) – MappingCharFilter (normalisation of characters, removing accents) – Regular expression filter
  44. Web admin interface • Show config, schema, distribution info •

    Query interface • Statistics – Caches: lookups, hits, hitratio, inserts, evictions, size – RequestHandlers: requests, errors – UpdateHandler: adds, deletes, commits, optimizes – Indexreader: opentime, indexversion, numdocs, maxdocs • Analysys debugger – Show tokesn after each analyzer stage – Show token matches for query vs index
  45. Analysis Tool • HTML Form Allowing You To Feed In

    Text And See How It • Would Be Analyzed For A Given Field (Or Field Type) • Displays Step By Step Information For Analyzers • Configured Using Solr Factories... • Token Stream Produced By The Tokenizer • How The Token Stream Is Modified By Each TokenFilter • How The Tokens Produced When Indexing Compare With • The Tokens Produced When Querying • Helpful In Deciding Which Tokenizer/TokenFilters You • Want To Use For Each Field Based On Your Goals
  46. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • WhitespaceAnalyzer • Simplest built-in analyzer [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
  47. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • SimpleAnalyzer Lowercases, splits at non-letter boundaries • [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
  48. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • StopAnalyzer Lowercases and removes stop words • [quick] [brown] [fox] [jumps] [over] [lazy] [dog]
  49. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • SnowballAnalyzer Stemming algorithm • [the] [quick] [brown] [fox] [jump] [over] [the] [lazi] [dog]
  50. Do I find “cheval” when searching for “chevaux”? Is document

    93345 found when searching for “+montreux –casino AND role:story”
  51. Indexing performance tips • Tricks of the trade: • multithread/multiprocess

    • batch documents • separate Solr server and indexers • Indexing master + replicants • StreamingUpdateSolrServer + javabin
  52. Search performance tips • Searching Performance • javabin - binary

    protocol for Java clients • caches: filterCache most relevant here • Autowarm • FastLRUCache • warming queries: firstSearcher, newSearcher • sorting, faceting
  53. • They're fast and designed to index and search large

    bodies of data efficiently. • Both have a long list of high-traffic sites using them • Both offer commercial support. • Both offer client API bindings for several platforms/languages • Both can be distributed to increase speed and capacity First round! Similarities