Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting The Most Out Of Haystack

Getting The Most Out Of Haystack

An introduction to search fundamentals & Haystack (pluggable search for Django).

daniellindsley

September 26, 2011
Tweet

More Decks by daniellindsley

Other Decks in Programming

Transcript

  1. “Engine” • The actual search engine • Here be interesting

    computer science problems • Examples: Solr, Xapian, Whoosh Tuesday, December 28, 2010
  2. “Document” • A single record in the index • Usually

    accompanied by 1+ fields of metadata • Heavily processed Tuesday, December 28, 2010
  3. “Stemming” • Find the root of the word • Part

    of the “magic” of search • More on this later... Tuesday, December 28, 2010
  4. “Relevance” • A metric of how well a document matches

    the query • Search’s killer feature • Hard to get 100% right Tuesday, December 28, 2010
  5. “Faceting” • Count of docs meeting certain criteria within your

    result set • Drill down! • Think Amazon/eBay • More on this later... Tuesday, December 28, 2010
  6. “Boost” • A way to artificially increase the relevance of

    document • Types: Document/Field/Term Tuesday, December 28, 2010
  7. Search != RDBMS • The sooner you get over that,

    the easier everything that follows will be. • Think “document store”. Tuesday, December 28, 2010
  8. Stemming • Porter-Stemmer or Snowball • The engine takes terms

    & hacks them down to the root word. • Examples: “testing” ! “test” “searchers” ! “searcher” ! “search” Tuesday, December 28, 2010
  9. Inverted Index • The power of the engine starts here

    • Basically a reverse mapping between the stemmed form of a term to a collection of documents containing the term ...“search”: [3, 104, 238],... Tuesday, December 28, 2010
  10. Inverted Index • Very fast lookups • NOT a “contains”

    or “like” lookup unless you say so (slower) Tuesday, December 28, 2010
  11. Document Store • Flat structure • Generally free-form/schema-less • Easiest

    to think about each record as a dictionary • No relations built-in Tuesday, December 28, 2010
  12. Why custom search? • You control what is (and is

    not) indexed Tuesday, December 28, 2010
  13. Why custom search? • You control what is (and is

    not) indexed • Better quality data goes into the index Tuesday, December 28, 2010
  14. Why custom search? • You control what is (and is

    not) indexed • Better quality data goes into the index • Information-specific handling Tuesday, December 28, 2010
  15. Why custom search? • You control what is (and is

    not) indexed • Better quality data goes into the index • Information-specific handling • Provide context-specific search Tuesday, December 28, 2010
  16. What is Haystack? At its simplest, Haystack is an abstraction

    layer for integrating Django with a search engine. Tuesday, December 28, 2010
  17. Why Haystack? • Pluggable Backends • Support Solr & Whoosh

    out of the box, Xapian with a third-party backend (boo GPL!) • Your code stays the same regardless of backend. Tuesday, December 28, 2010
  18. Why Haystack? • Advanced Features • Faceting • More Like

    This • Highlighting • Boost Tuesday, December 28, 2010
  19. Why Haystack? • Integration with third-party apps • No need

    to fork their code • Put the indexes in your code & register them • Applies to django.contrib as well. Tuesday, December 28, 2010
  20. Why Haystack? • Real Live Documentation™! • http://docs.haystacksearch.org/dev/ • Test

    Coverage! • Decent coverage • No new commits without tests Tuesday, December 28, 2010
  21. Two Phase Approach • The “Data In” is SearchIndex •

    The “Data Out” is SearchQuerySet • Note: There’s a disconnect between your database & the search index Tuesday, December 28, 2010
  22. SearchIndex • Provides the means to get data into the

    index • Something of a cross between a Form (the data preparation aspects) and Model (the persistence) Tuesday, December 28, 2010
  23. SearchIndex from haystack import indexes, site from myapp.models import Entry

    class EntrySearchIndex(indexes.SearchIndex): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr=‘user__username’) created = indexes.DateTimeField() def get_queryset(self): return Entry.objects.published() def prepare_created(self, obj): return obj.pub_date or datetime.datetime.now() site.register(Entry, EntrySearchIndex) Tuesday, December 28, 2010
  24. `use_template=True`? • Use Django templates to prep the data •

    Example: # search/indexes/myapp/entry_text.txt {{ obj.title }} {{ obj.author.get_full_name }} {{ obj.tease }} {{ obj.content }} Tuesday, December 28, 2010
  25. SearchQuerySet • The reason to use Haystack • Very powerful

    • Forget views, forms, etc. They’re all thin wrappers around SearchQuerySet Tuesday, December 28, 2010
  26. SearchQuerySet • Fetches data from the index • Very similar

    to QuerySet • Intentional, to reduce conceptual overhead • Lazily evaluated • Chain methods Tuesday, December 28, 2010
  27. SearchQuerySet • By default, searches across all models • Can

    limit using SearchQuerySet.models • Caches where possible Tuesday, December 28, 2010
  28. SearchQuerySet >>> import datetime >>> from haystack.query import SearchQuerySet >>>

    sqs = SearchQuerySet().models(Entry) >>> sqs = sqs.filter(created__lte=datetime.datetime.now()) >>> sqs = sqs.exclude(author=‘daniel’) # Lazily performed the query when asked for results. >>> sqs [<SearchResult: myapp.entry (pk=u'5')>, <SearchResult: myapp.entry (pk=u'3')>, <SearchResult: myapp.entry (pk=u'2')>] # Iterable interface. # Still hasn’t hit the DB. >>> [result.author for result in sqs] [‘johndoe’, ‘sally1982’, ‘bob_the_third’] Tuesday, December 28, 2010
  29. SearchQuerySet # Hits the database once per result. >>> [result.object.user.first_name

    for result in sqs] [‘John’, ‘Sally’, ‘Bob’] # More efficient loading from database (one query total). >>> [result.object.user.first_name for result in sqs.load_all()] [‘John’, ‘Sally’, ‘Bob’] Tuesday, December 28, 2010
  30. SearchView • Class-based view • Hit 80% of the regular

    usage • A guideline to more advanced use • Relies heavily on SearchForm Tuesday, December 28, 2010
  31. SearchForm • Outside of using SearchQuerySet, it’s a standard Django

    form • Defines a search method that does the necessary actions Tuesday, December 28, 2010
  32. SearchForm from django import forms from haystack.forms import SearchForm from

    myapp.models import Entry class EntrySearchForm(SearchForm): # Additional fields go here. author = forms.CharField(max_length=255, required=False) def search(self): sqs = super(EntrySearchForm, self).search() if self.cleaned_data.get(‘author’): sqs = sqs.filter(author=self.cleaned_data[‘author’]) return sqs Tuesday, December 28, 2010
  33. SearchSite • Registry pattern • Collects all registered SearchIndex classes

    • Used by SearchQuerySet to limit results to only things Haystack knows about • Think django.contrib.admin.site. Tuesday, December 28, 2010
  34. Common Fields • Try to find common fields as much

    as possible • Reuse where it makes sense • But don’t shoehorn if it doesn’t work Tuesday, December 28, 2010
  35. It’s Just Python • When an out-of-box doesn't work for

    you, use SearchQuerySet & write what you need. • It's just Django & Python. Tuesday, December 28, 2010
  36. load_all • Appropriate use of SearchQuerySet.load_all • One hit to

    the DB per content type • But do you need to hit the DB? Tuesday, December 28, 2010
  37. More Like This • Cheap & very worth it •

    LJWorld saw a 30% jump in traffic by adding it solely on story detail views. • Cache it! Tuesday, December 28, 2010
  38. “Third Party” Apps • queued_search • https://github.com/toastdriven/ queued_search • saved_searches

    • https://github.com/toastdriven/ saved_searches Tuesday, December 28, 2010
  39. Other Ideas • Admin Integration • Integration with API •

    Search “grouping” • Vertical search Tuesday, December 28, 2010
  40. Tomcat vs. Jetty • Very close performance-wise • Tomcat better

    when busy • Jetty is smaller on RAM & easier to run Tuesday, December 28, 2010
  41. Tune JVM settings -Xms (Minimum size) -Xmx (Maximum size) #

    Something close to... - ``java -Xms1G -Xmx12G -jar start.jar`` - -XX:+PrintGCDetails (print GC info) - -XX:+PrintGCTimeStamps (print GC info + timestamps) Tuesday, December 28, 2010
  42. JMX Console • java -Dcom.sun.management.jmxremote -jar start.jar • Then jconsole

    • Find jetty in the process list. • Lots of instrumentation Tuesday, December 28, 2010
  43. • Proper query warming • The default “solr rocks” doesn’t.

    • Remove unused handlers (like partition) Tune solrconfig Tuesday, December 28, 2010
  44. Tune solrconfig • Tuning the mergeFactor • Not too high,

    not too low • Big trade-off Tuesday, December 28, 2010
  45. Schema • use omitNorms where possible • Only needed on

    full-text fields • Same goes for indexed & stored • The fewer fields, the better Tuesday, December 28, 2010
  46. Optimize! • Seriously. • Goes back through existing indexes &

    cleans up • Takes awhile to run, so make sure your timeout is high (custom settings file) Tuesday, December 28, 2010
  47. Commits • Commit as infrequently as is reasonable • Commit

    as much as you can at once • queued_search shines here Tuesday, December 28, 2010
  48. Debugging • Use &debugQuery=on to debug queries • Use the

    browser interface! Tuesday, December 28, 2010
  49. Advanced Bits • Learn & love the Solr stats page

    • Replication Tuesday, December 28, 2010
  50. Advanced Bits • Learn & love the Solr stats page

    • Replication • n-gram based autocomplete Tuesday, December 28, 2010
  51. Advanced Bits • Learn & love the Solr stats page

    • Replication • n-gram based autocomplete • Spelling suggestions • the (Haystack) documented config sucks Tuesday, December 28, 2010
  52. Advanced Bits • Learn & love the Solr stats page

    • Replication • n-gram based autocomplete • Spelling suggestions • the (Haystack) documented config sucks • Dismax Handler Tuesday, December 28, 2010
  53. Resources • https://gist.github.com/215331 • http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/ Scaling-Lucene-and-Solr • http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot- camp-draft/ •

    http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html • http://wiki.apache.org/solr/SolrJmx • http://wiki.apache.org/solr/LargeIndexes • http://wiki.apache.org/solr/SolrPerformanceFactors Tuesday, December 28, 2010