Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Solr

Apache Solr

Friday Talk @ Softonic

673b4e4d96256f465005bd96880ca86d?s=128

Santiago Lizardo

July 15, 2011
Tweet

Transcript

  1. Santiago Lizardo Friday Talk (15/07/2011) 101 (now in bad English!)

  2. Search server

  3. Why not a RDBMS?

  4. SELECT * FROM post WHERE topic LIKE „%foobar%‟ OR author

    LIKE „%foobar%‟ ORDER BY id DESC
  5. SELECT * FROM articles WHERE MATCH (title, body) AGAINST (

    '+MySQL -YourSQL' IN BOOLEAN MODE )
  6. Conclusion so far RDBMS aren‟t designed for searching.

  7. None
  8. None
  9. None
  10. fast open highlighting replication faceting spellchecker similars flexible

  11. None
  12. None
  13. EZ installation • Download and install Tomcat • Download the

    Solr WAR and copy it to webapps • Define the Solr home variable -Dsolr.solr.home=… conf\catalina\localhost\solrconfig. xml
  14. Directory layout • ${solr.home} – conf • schema.xml • solrconfig.xml

    – data – logs – bin
  15. Solrconfig.xml • Lucene indexing parameters • Cache settings • Request

    handler configuration • HTTP cache settings • Search components, response writers, query parsers
  16. Solr Schema • Lucene has no notion of schema –

    Sorting: string vs numeric – No ranges • Defines fields, types and properties • Defines unique key field, default search field • schema.xml – Defines types used in the webapp – Defines fields and types – Define copyfields
  17. Solr data model • Solr maintains a collection of documents

    • A document is a collection of fields & values • A field can occur multiple times in a document • Documents are inmutable – They can be deleted, and a new version added, however.
  18. Solr data model • A document is not a database

    row! • A solr Index store only ONE kind of document definition • A document has typed properties: string, date, integer • Static definition or dynamic type • May be indexed or stored • De-normalize your database into a structured document optimized for the search requirements
  19. Types • How the words are split? (whitespace, punctuaction) CIA

    != C.I.A? • Stemming • Case folding
  20. Multivalued field • The property is similar to an array

    • Neat solution for storing a set of categories linked to a product or permissions linked to a document
  21. copyField • Copies one field to another at index time

    • Use case: Analyze same field different ways – Copy into a field with a different analyzer – Boost exact-case, exact-punctuation matches – Language translations, thesaurus, soundex • Use case 2: Index multiple fields into single searchable field
  22. Copy fields • Two main uses – To concatenate fields

    – To analyze a field in two different ways
  23. Adding data (indexing) An HTTP POST request to /update <add>

    <doc> <field name=“title”>scooter</field> <field name=”price”>42.30</field> </doc> </add>
  24. Querying • HTTP request • http://localhost:8080/co mix/select/?q=data&ind ent=on

  25. Command line with curl • curl URL -H “Content-type: text/xml”

    --data- binary “<commit />”
  26. Query parameters • Query arguments for HTTP GET/POST to /select

    – “q” the query – “start” (0) offset – “rows” (10) number of docs – “fl” (*) fields to return – “qt” (standard) query type, maps to query handler – “df” (schema) default field to search – “qt” query type (response writer)
  27. Response writers • XML (Standard) • Python • PHP •

    JSON • Ruby • XSLT (output)
  28. &start=0 (default 0) &rows=10 (default 10)

  29. Solr Query syntax • Similar to Lucene • Include (+),

    exclude (-) • Field-specific searching: <fieldname>:<fieldvalue> • Wildcard searching: “*” or “?” Ip?d Belk* *deo
  30. Solr Query syntax • paris • city:paris • title:”The Right

    Way” AND text:go • price:[100 TO 300] • -type:sale • te?t • theat* • te*t • test~
  31. Solr Query syntax • Range searching – Timestamp:[2006-01-01 TO *]

    • Proximity searching: “~” – “video ipod”~3 (up to 3 words apart) • Fuzzy searches: “~” – Ipod~ (will find ipod and ipods) – Belkin~0.8 (will find words close spellings)
  32. Debugging query • Add &debugQuery=on to request params• &debugQuery=true is

    your friend • Returns scoring information • Returns parsed form of query • Includes parsed query, explanations, and • search component timings in response
  33. Deleting data • Delete by id – <delete><id>1</id></delete> • Delete

    by query – <delete><query>city:paris</query></delete>
  34. Commiting • Nothing shows up in the index until you

    commit <commit /> • /solr/update <optimize /> sames as commit, merges all index segments
  35. Rollback • <rollback/> to last commit point

  36. Solr clients (APIs) • HTTP GET/POST (curl or any other

    HTTP client) • SolrJ (embedded or HTTP - Java) • Ruby: solr-ruby, RSolr • Python • C++ • Solrsharp • PHP!
  37. • Roll your own classes – Not difficult, it’s REST

    after all – Some Curl, XML, Json or native PHP array parsing • Using existing libraries – PECL – http://us.php.net/manual/en/book.solr.php – Solr-php-client (follows ZF Coding Standards) – Ez Components ezcSearch
  38. include "bootstrap.php"; $options = array ( 'hostname' => SOLR_SERVER_HOSTNAME, 'login'

    => SOLR_SERVER_USERNAME, 'password' => SOLR_SERVER_PASSWORD, 'port' => SOLR_SERVER_PORT, ); $client = new SolrClient($options); $doc = new SolrInputDocument(); $doc->addField('id', 334455); $doc->addField('cat', 'Software'); $doc->addField('cat', 'Lucene'); $updateResponse = $client->addDocument($doc); print_r($updateResponse->getResponse());
  39. include "bootstrap.php"; $options = array ( 'hostname' => SOLR_SERVER_HOSTNAME, 'login'

    => SOLR_SERVER_USERNAME, 'password' => SOLR_SERVER_PASSWORD, 'port' => SOLR_SERVER_PORT, ); $client = new SolrClient($options); $updateResponse = $client->deleteByQuery(‘city:Barcelona’); print_r($updateResponse->getResponse());
  40. include "bootstrap.php"; $options = array ( 'hostname' => SOLR_SERVER_HOSTNAME, 'login'

    => SOLR_SERVER_USERNAME, 'password' => SOLR_SERVER_PASSWORD, 'port' => SOLR_SERVER_PORT, ); $client = new SolrClient($options); $query = new SolrQuery(); $query->setQuery('lucene'); $query->setStart(0); $query->setRows(50); $query->addField('cat')->addField('features')->addField('id')- >addField('timestamp'); $query_response = $client->query($query); $response = $query_response->getResponse(); print_r($response);
  41. SolrJ SolrServer solr = new CommonsHttpSolrServer( new URL("http://localhost:8983/solr")); SolrInputDocument doc

    = new SolrInputDocument(); doc.addField("id", "EXAMPLEDOC01"); doc.addField("title", "NOVAJUG SolrJ Example"); solr.add(doc); solr.commit(); // after a batch, not per document solr.optimize(); // periodically, if/when needed
  42. None
  43. HighlightingParameters hl => true/false to enable/disable highlighting hl.fl => in

    which field apply the highlighting (comma/space separated) hl.snippets => max number of snippets http://localhost:8983/solr/select?q=apple&hl=on&hl.fl=*
  44. FACETING Group the results by category Can do multiple facets

    at once Returns matching count
  45. Faceting • Facet on: field terms, queries, date ranges &facet=on

    &facet.field=cat &facet.query=price:[0 TO 100] • SimpleFacetParameters
  46. None
  47. Spell checking • File or index-based dictionaries • Dictionary lookup

    • Using the indexed words itself • Supports pluggable distance algorithms: • Levenstein and JaroWinkler
  48. More like this

  49. Query elevation

  50. • Configurable through the “elevate.xml” config file to boost/exclude specific

    documents • Based on the QueryElevationComponent
  51. None
  52. DEDUPLICATION • Duplicates detection • Adds a signature field •

    Exact or Fuzzy duplicate detection
  53. • Single primary index – Cars – Exclusive configuration files

    • schema.xml, solrconfig.xml Solr CORE
  54. Multi core http://localhost:8983/solr/core0-cars/select?q=ford+fiesta http://localhost:8983/solr/core1-jobs/select?q=php+developer http://localhost:8983/solr/core0-cars/admin/ http://localhost:8983/solr/core1-jobs/admin/

  55. Using a solr.xml file, you can configure Solr to manage

    several different indexes. <solr persistent="true" sharedLib="lib“> <cores adminPath="/core-admin/"> <core name="books" instanceDir="books" /> <core name="games" instanceDir="games" /> </solr> Multi core
  56. Data import handler • Indexes relational database, XML data, and

    email sources • Supports full and incremental/delta indexing • Highly extensible with custom data sources, transformers, etc
  57. Solr Cell aka ExtractingRequestHandler Leveraging Tika, extracts and indexes rich

    documents such as Word, PDF, HTML, and many other types curl http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' –F myfile=@tutorial.html
  58. CL US TE RI NG

  59. Architecture • Scales from – Single solr server – Master/replicants

    (slaves) – Distributed shards • Each solr instance can also have multiple cores
  60. Caching

  61. None
  62. None
  63. None
  64. None
  65. Replication

  66. Relevance • Term frequency (TF): number of times a term

    appears in a document • Inverse document frequency (IDF): One over number of times term appears in the index (1/df)
  67. Request handlers • Defines how the query is processed •

    Two main types – StandardRequestHandler • Simple queries – DisMaxRequestHandler • Boost functions • Boost fields • Span query to many fields
  68. Request handler • mini-“servlets” • SearchHandler extensions chain search components

    • Flexible response formatting: • &wt=[json, ruby, xslt, php, phps, javabin, python,velocity]
  69. Useful request handlers • Dump, ping, system, plugins, threads, properties,

    file
  70. Dump • http://localhost:8983/solr/debug/dump • Echoes parameters, content streams, and Solr

    web context • Careful with content stream enabled, client could retrieve contents of any file on server or accessible network! [Solution: disable dump request handler]
  71. Ping • http://localhost:8983/solr/admin/ping • If healthcheck configured and file not

    available, error is reported • Executes single configured request and reports failure or OK
  72. System • http://localhost:8983/solr/admin/system • Core info, Lucene version, JVM details,

    uptime, operating system info
  73. Plugins • http://localhost:8983/solr/admin/plugins • Configuration details of Solr core, available

    query and update handlers, cache settings
  74. Threads • http://localhost:8983/solr/admin/threads • JVM thread details

  75. Properties • http://localhost:8983/solr/admin/properties • All JVM system properties, or single

    property value (?name=os.arch)
  76. File • http://localhost:8983/solr/admin/file?file=/ • See fetchable directory tree http://localhost:8983/solr/admin/file?file=schema.x ml&contentType=text/plain

  77. Dismax • Minimum match: for optional clauses • Default: 100%

    (pure AND) • Examples: – Pure OR: mm= 0 or mm=0% – At least tow should match=2 – At least 75% should match mm:75%
  78. Search components • Default Components That Power SearchHandler QueryComponent, HighlightComponent,

    FacetComponent, MoreLikeThisComponent, StatsComponent, DebugComponent • Additional Components You Can Configure SpellCheckComponent, QueryElevationComponent, TermsComponent, TermVectorComponent, ClusteringComponent
  79. Boost functions • Allow to influence scoring at runtime •

    Computationally expensive! • Really useful for tuning scoring
  80. Term Enumerates terms from specified fields http://localhost:8983/solr/terms?terms.fl=name&ter ms.sort=index&terms.prefix=vi

  81. What's in a token?

  82. Text analysis

  83. Stemming • Reduce terms to their root form • Language

    specific • Many specialised stemmers available – Most european languages
  84. •Inject synonyms for certain terms •Language specific •Best used for

    query time analysis •May inflate the search index too much •Decreases relevancy
  85. Tokenizer Analysis

  86. Tokenizers And TokenFilters • Analyzers Are Typical Comprised Of Tokenizers

    And TokenFilters • Tokenizer: Controls How Your Text Is Tokenized • TokenFilter: Mutates And Manipulates The Stream Of Tokens • Solr Lets You Mix And Match Tokenizers and TokenFilters • In Your schema.xml To Define Analyzers On The Fly • OOTB Solr Has Factories For 17 Tokenizers and 45 TokenFilters • Many Factories Have Customization Options – Limitless Combinations
  87. Tokenizers And TokenFilters <fieldType name="text" class="solr.TextField"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/>

    <filter class="solr.StopFilterFactory words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory“ generateWordParts="1" generateNumberParts="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory“ protected="protwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory“ synonyms="synonyms.txt" expand="true"/> ...
  88. Notable Token(izers|Filters) • StandardTokenizerFactory • WhitespaceTokenizerFactory • KeywordTokenizerFactory • NGramTokenizerFactory

    • PatternTokenizerFactory • EnglishPorterFilterFactory • SynonymFilterFactory • StopFilterFactory • ASCIIFilterFactory • PatternReplaceFilterFactory
  89. Character filters • Used to cleanup text before tokenizing –

    HTMLStripCharFilter (strips html, xml, js, css) – MappingCharFilter (normalisation of characters, removing accents) – Regular expression filter
  90. Web admin interface • Show config, schema, distribution info •

    Query interface • Statistics – Caches: lookups, hits, hitratio, inserts, evictions, size – RequestHandlers: requests, errors – UpdateHandler: adds, deletes, commits, optimizes – Indexreader: opentime, indexversion, numdocs, maxdocs • Analysys debugger – Show tokesn after each analyzer stage – Show token matches for query vs index
  91. Analysis Tool • HTML Form Allowing You To Feed In

    Text And See How It • Would Be Analyzed For A Given Field (Or Field Type) • Displays Step By Step Information For Analyzers • Configured Using Solr Factories... • Token Stream Produced By The Tokenizer • How The Token Stream Is Modified By Each TokenFilter • How The Tokens Produced When Indexing Compare With • The Tokens Produced When Querying • Helpful In Deciding Which Tokenizer/TokenFilters You • Want To Use For Each Field Based On Your Goals
  92. None
  93. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog.
  94. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • WhitespaceAnalyzer • Simplest built-in analyzer [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog.]
  95. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • SimpleAnalyzer Lowercases, splits at non-letter boundaries • [the] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog]
  96. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • StopAnalyzer Lowercases and removes stop words • [quick] [brown] [fox] [jumps] [over] [lazy] [dog]
  97. Analyzing the analyzer • The quick brown fox jumps over

    the lazy dog. • SnowballAnalyzer Stemming algorithm • [the] [quick] [brown] [fox] [jump] [over] [the] [lazi] [dog]
  98. Do I find “cheval” when searching for “chevaux”? Is document

    93345 found when searching for “+montreux –casino AND role:story”
  99. None
  100. Indexing performance tips • Tricks of the trade: • multithread/multiprocess

    • batch documents • separate Solr server and indexers • Indexing master + replicants • StreamingUpdateSolrServer + javabin
  101. Search performance tips • Searching Performance • javabin - binary

    protocol for Java clients • caches: filterCache most relevant here • Autowarm • FastLRUCache • warming queries: firstSearcher, newSearcher • sorting, faceting
  102. • They're fast and designed to index and search large

    bodies of data efficiently. • Both have a long list of high-traffic sites using them • Both offer commercial support. • Both offer client API bindings for several platforms/languages • Both can be distributed to increase speed and capacity First round! Similarities
  103. • Foundation vs company • Language • Licenses Second round!

    Differences
  104. Sphinx as a complementary service Solr as the main feature

    Third round! Conclusion
  105. None
  106. None
  107. None
  108. Questions?