Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching with Solr: An Introduction

Tyler Harms
November 10, 2012

Searching with Solr: An Introduction

A brief introduction to using Apache Solr for implementing search for your website.

Tyler Harms

November 10, 2012
Tweet

More Decks by Tyler Harms

Other Decks in Programming

Transcript

  1. SEARCHING WITH SOLR Why Implement Solr? • Does your site

    need search? • Is google enough? • Do you need/want to control rankings? • Just text, or Structured Data? 2 Saturday, November 10, 12
  2. SEARCHING WITH SOLR What is Solr? 3 Solr is a

    standalone enterprise search server with a REST-like API. You put documents in it [...] over HTTP. You query it via HTTP GET and receive [...] results. Saturday, November 10, 12
  3. SEARCHING WITH SOLR • Current Version(s) • Solr 3.6.1 •

    Solr 4 • Released Versions are always stable 5 Solr Versions Saturday, November 10, 12
  4. 6 $ wget http://(...)/3.6.1/apache-solr-3.6.1.tgz $ tar -xzf apache-solr-3.6.1.tgz $ cd

    apache-solr-3.6.1/example/ $ java -jar start.jar (a lot of java log...) Saturday, November 10, 12
  5. SEARCHING WITH SOLR • Google • Lucene • elasticsearch •

    Whoosh • Xapien • Many Others 7 Search Alternatives Saturday, November 10, 12
  6. SEARCHING WITH SOLR NOT a Database Replacement • Solr is

    designed to live alongside your website as a separate web app 8 Saturday, November 10, 12
  7. SEARCHING WITH SOLR Scaling Solr • Master/Slave Architecture • Write

    to master -> Read from slaves • Multicore Setup • Multiple Solr ‘cores’ running alongside each other within the same install 10 Saturday, November 10, 12
  8. SUB HEADLINE Solr’s Data Model • Solr maintains a collection

    of documents • A document is a collection of fields and values • A field can occur multiple times in a doc • Documents are immutable • They can be deleted and replaced by new versions, however. 11 SEARCHING WITH SOLR Saturday, November 10, 12
  9. SUB HEADLINE Solr Query Syntax • blend (value) • company:blend

    (field:value) • title:”Searching with Solr” AND text:apache • id:[* TO *] • *:* (all fields : all values) 13 SEARCHING WITH SOLR Saturday, November 10, 12
  10. SUB HEADLINE Using Solr • Getting Data into Solr •

    Getting Data out of Solr 14 SEARCHING WITH SOLR Saturday, November 10, 12
  11. SUB HEADLINE Getting Data into Solr • POST it 15

    SEARCHING WITH SOLR <add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> Saturday, November 10, 12
  12. SUB HEADLINE Getting Data into Solr • POST it 16

    SEARCHING WITH SOLR <add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> Saturday, November 10, 12
  13. SUB HEADLINE Getting Data into Solr • POST it 17

    SEARCHING WITH SOLR <add> <doc> <field name="abstract">Lorem ipsum</field> <field name="company">Blend Interactive</field> <field name="text">Lorem Ipsum</field> <field name="title">Some Title</field> </doc> [<doc> ... </doc>[<doc> ... </doc>]] </add> Saturday, November 10, 12
  14. SUB HEADLINE Commiting • Nothing shows up in the index

    until you commit • You can just POST <commit/> to: • http://<host>:<port>/solr/update 18 SEARCHING WITH SOLR Saturday, November 10, 12
  15. 20 <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">19</int> <lst name="params">

    <str name="q">solr</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="abstract"> A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="django_ct">codecamp.session</str> <str name="django_id">19</str> <str name="id">codecamp.session.19</str> <str name="text"> Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="title">Searching with Solr: An Introduction</str> </doc> </result> </response> Saturday, November 10, 12
  16. 21 <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">19</int> <lst name="params">

    <str name="q">solr</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="abstract"> A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="django_ct">codecamp.session</str> <str name="django_id">19</str> <str name="id">codecamp.session.19</str> <str name="text"> Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="title">Searching with Solr: An Introduction</str> </doc> </result> </response> Saturday, November 10, 12
  17. 22 <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">19</int> <lst name="params">

    <str name="q">solr</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="abstract"> A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="django_ct">codecamp.session</str> <str name="django_id">19</str> <str name="id">codecamp.session.19</str> <str name="text"> Searching with Solr: An Introduction A brief introduction to using Apache Solr for implementing search for your website. </str> <str name="title">Searching with Solr: An Introduction</str> </doc> </result> </response> Saturday, November 10, 12
  18. 24 { "responseHeader": { "status":0, "QTime":0, "params": { "wt":"json", "q":"solr"

    } }, "response": { "numFound":1, "start":0, "docs":[{ "django_id":"19", "title":"Searching with Solr: An Introduction", "text":"Searching with Solr: An Introduction\nA brief introduction to using Apache Solr for implementing search for your website.", "abstract":"A brief introduction to using Apache Solr for implementing search for your website.", "django_ct":"codecamp.session","id":"codecamp.session.19" }] } } Saturday, November 10, 12
  19. SUB HEADLINE Deleting Data from Solr • POST it 25

    SEARCHING WITH SOLR <delete><id>codecamp.session.19</id></delete> <delete><query>company:blend</query></delete> Saturday, November 10, 12
  20. SEARCHING WITH SOLR The Solr Schema • schema.xml • Defines

    ‘types’ used in the webapp • Defines the fields • Defines ‘copyfields’ • Read the schema inside the example project for more 26 Saturday, November 10, 12
  21. SEARCHING WITH SOLR The Solr Schema • Types • Define

    how a field and query should be processed • Word Stemming • Case Folding • How would you handle a search for ‘C.I.A.’? • Dates, ints, floats, etc.. are defined here as well • 2 Modes • Index Time • Query Time 27 Saturday, November 10, 12
  22. 28 <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter

    class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType> Saturday, November 10, 12
  23. 29 <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter

    class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType> Saturday, November 10, 12
  24. 30 <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter

    class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> </analyzer> </fieldType> Saturday, November 10, 12
  25. SEARCHING WITH SOLR Fields • The elements of a document

    • Both Predefined and Dynamic • Fields may occur multiple times • May be indexed and/or stored 31 Saturday, November 10, 12
  26. 32 <fields> <!-- general --> <field name="id" type="string" indexed="true" stored="true"

    multiValued="false" required="true"/> <field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /> <field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /> <!-- dynamic --> <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> <!-- app --> <field name="bio" type="text" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text" indexed="true" stored="true" multiValued="false" /> <field name="text" type="text" indexed="true" stored="true" multiValued="false" /> <field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /> <field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /> <field name="company" type="text" indexed="true" stored="true" multiValued="false" /> </fields> Saturday, November 10, 12
  27. 33 <fields> <!-- general --> <field name="id" type="string" indexed="true" stored="true"

    multiValued="false" required="true"/> <field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /> <field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /> <!-- dynamic --> <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> <!-- app --> <field name="bio" type="text" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text" indexed="true" stored="true" multiValued="false" /> <field name="text" type="text" indexed="true" stored="true" multiValued="false" /> <field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /> <field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /> <field name="company" type="text" indexed="true" stored="true" multiValued="false" /> </fields> Saturday, November 10, 12
  28. 34 <fields> <!-- general --> <field name="id" type="string" indexed="true" stored="true"

    multiValued="false" required="true"/> <field name="django_ct" type="string" indexed="true" stored="true" multiValued="false" /> <field name="django_id" type="string" indexed="true" stored="true" multiValued="false" /> <!-- dynamic --> <dynamicField name="*_i" type="sint" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="slong" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="sfloat" indexed="true" stored="true"/> <dynamicField name="*_d" type="sdouble" indexed="true" stored="true"/> <dynamicField name="*_dt" type="date" indexed="true" stored="true"/> <!-- app --> <field name="bio" type="text" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text" indexed="true" stored="true" multiValued="false" /> <field name="text" type="text" indexed="true" stored="true" multiValued="false" /> <field name="abstract" type="text" indexed="true" stored="true" multiValued="false" /> <field name="full_name" type="text" indexed="true" stored="true" multiValued="false" /> <field name="company" type="text" indexed="true" stored="true" multiValued="false" /> </fields> Saturday, November 10, 12
  29. SEARCHING WITH SOLR Copy Fields • Two Main Uses •

    Analyze fields in different ways • Concatenate Fields 35 Saturday, November 10, 12
  30. 38 <copyField source="bio" dest="df_text" /> <copyField source="year" dest="century" maxChars="2"/> 2000

    would be stored as 20 Useful for custom faceting Saturday, November 10, 12
  31. SUB HEADLINE The Solr Config File • solrconfig.xml • Defines

    request handlers, defaults, & caches • Read the solrconfig.xml inside the example project for more 39 SEARCHING WITH SOLR Saturday, November 10, 12
  32. SUB HEADLINE Other Solr Tools • Debug Query • Boost

    Functions • Search Faceting • Search Filters • Search Highlighting • Solr Admin 40 SEARCHING WITH SOLR Saturday, November 10, 12
  33. SUB HEADLINE Debug Query Option • Add &debugQuery=on to request

    parameters • Returns a parsed form of the query 41 SEARCHING WITH SOLR Saturday, November 10, 12
  34. 42 <lst name="debug"> <str name="rawquerystring">solr</str> <str name="querystring">solr</str> <str name="parsedquery">text:solr</str> <str

    name="parsedquery_toString">text:solr</str> <lst name="explain"> <str name="codecamp.session.19"> 1.2147729 = (MATCH) fieldWeight(text:solr in 17), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.9267395 = idf(docFreq=2, maxDocs=56) 0.21875 = fieldNorm(field=text, doc=17) </str> </lst> Saturday, November 10, 12
  35. 43 <lst name="debug"> <str name="rawquerystring">solr</str> <str name="querystring">solr</str> <str name="parsedquery">text:solr</str> <str

    name="parsedquery_toString">text:solr</str> <lst name="explain"> <str name="codecamp.session.19"> 1.2147729 = (MATCH) fieldWeight(text:solr in 17), product of: 1.4142135 = tf(termFreq(text:solr)=2) 3.9267395 = idf(docFreq=2, maxDocs=56) 0.21875 = fieldNorm(field=text, doc=17) </str> </lst> Saturday, November 10, 12
  36. SUB HEADLINE Boost Function • Allows you to influence results

    at query time • Really useful for tuning scoring • You can also boost at index time 44 SEARCHING WITH SOLR Saturday, November 10, 12
  37. SUB HEADLINE Boost Function • Allows you to influence results

    at query time • Really useful for tuning scoring • You can also boost at index time 45 SEARCHING WITH SOLR q=blend&qf=text^2 company Saturday, November 10, 12
  38. SUB HEADLINE Boost Function • Allows you to influence results

    at query time • Really useful for tuning scoring • You can also boost at index time 46 SEARCHING WITH SOLR q=blend&qf=text^2 company More information available - http://wiki.apache.org/solr/ SolrRelevancyFAQ Can use both dismax and standard query handlers, I use dismax Saturday, November 10, 12
  39. SUB HEADLINE Boost Function • Allows you to influence results

    at query time • Really useful for tuning scoring • You can also boost at index time 47 SEARCHING WITH SOLR &bq=text:blend^2 More information available - http://wiki.apache.org/solr/ SolrRelevancyFAQ Can use both dismax and standard query handlers, I use dismax Saturday, November 10, 12
  40. SUB HEADLINE Solr Faceting • What is a facet? •

    “Interaction style where users filter a set of items by progressively selecting from only valid values of a faceted classification system” - Keith Instone, SOASIS&T, July 8, 2004 • What does it look like? • Make sure to use an untokenized field (e.g. string) • “San Jose” != “san”+“jose” 48 SEARCHING WITH SOLR Saturday, November 10, 12
  41. SUB HEADLINE Solr Filter Query • Used to narrow your

    search query • Restrict the super set of documents that can be returned • ‘fq’ parameter (short for Filter Query) 50 SEARCHING WITH SOLR Saturday, November 10, 12
  42. SUB HEADLINE Solr Filter Query • Used to narrow your

    search query • Restrict the super set of documents that can be returned • ‘fq’ parameter (short for Filter Query) 51 SEARCHING WITH SOLR q=*:* fq=company:blend Saturday, November 10, 12
  43. SUB HEADLINE Search Highlighting • Allow Solr to generate your

    highlight 52 SEARCHING WITH SOLR Saturday, November 10, 12
  44. SUB HEADLINE Search Highlighting • Allow Solr to generate your

    highlight 53 SEARCHING WITH SOLR Saturday, November 10, 12
  45. SUB HEADLINE Solr Admin • http://localhost:8983/solr/admin/ • Built in app

    for testing all search options • Field Analysis • Schema Browser • Full Query Interface • Solr Statistics • Solr Information • Many More Options 55 SEARCHING WITH SOLR Saturday, November 10, 12
  46. SUB HEADLINE Solr/Browse • Test your search configuration using the

    /browse requestHandler 56 SEARCHING WITH SOLR Saturday, November 10, 12
  47. SUB HEADLINE Resources • Apache Solr Website • http://lucene.apache.org/solr/ •

    Wiki, mailing list, bugs/features • Books 57 SEARCHING WITH SOLR Saturday, November 10, 12