Getting The Most Out Of Haystack

Getting The Most Out Of Haystack Daniel Lindsley Pragmatic Badger,
LLC Tuesday, December 28, 2010

Terminology Tuesday, December 28, 2010

“Engine” • The actual search engine • Here be interesting
computer science problems • Examples: Solr, Xapian, Whoosh Tuesday, December 28, 2010

“Document” • A single record in the index • Usually
accompanied by 1+ ﬁelds of metadata • Heavily processed Tuesday, December 28, 2010

“Corpus” • The collection of indexed documents • Latin for
“body” Tuesday, December 28, 2010

“Stemming” • Find the root of the word • Part
of the “magic” of search • More on this later... Tuesday, December 28, 2010

“Relevance” • A metric of how well a document matches
the query • Search’s killer feature • Hard to get 100% right Tuesday, December 28, 2010

“Faceting” • Count of docs meeting certain criteria within your
result set • Drill down! • Think Amazon/eBay • More on this later... Tuesday, December 28, 2010

“Boost” • A way to artiﬁcially increase the relevance of
document • Types: Document/Field/Term Tuesday, December 28, 2010

Introduction to Search Tuesday, December 28, 2010

Search != RDBMS • The sooner you get over that,
the easier everything that follows will be. • Think “document store”. Tuesday, December 28, 2010

Stemming • Porter-Stemmer or Snowball • The engine takes terms
& hacks them down to the root word. • Examples: “testing” ! “test” “searchers” ! “searcher” ! “search” Tuesday, December 28, 2010

Inverted Index • The power of the engine starts here
• Basically a reverse mapping between the stemmed form of a term to a collection of documents containing the term ...“search”: [3, 104, 238],... Tuesday, December 28, 2010

Inverted Index • Very fast lookups • NOT a “contains”
or “like” lookup unless you say so (slower) Tuesday, December 28, 2010

Document Store • Flat structure • Generally free-form/schema-less • Easiest
to think about each record as a dictionary • No relations built-in Tuesday, December 28, 2010

Why custom search? ...or... “Isn’t this what Google is for?”
Tuesday, December 28, 2010

Why custom search? • You control what is (and is
not) indexed Tuesday, December 28, 2010

not) indexed • Better quality data goes into the index Tuesday, December 28, 2010

not) indexed • Better quality data goes into the index • Information-speciﬁc handling Tuesday, December 28, 2010

not) indexed • Better quality data goes into the index • Information-speciﬁc handling • Provide context-speciﬁc search Tuesday, December 28, 2010

Introduction to Haystack Tuesday, December 28, 2010

What is Haystack? At its simplest, Haystack is an abstraction
layer for integrating Django with a search engine. Tuesday, December 28, 2010

Why Haystack? • Familiar API • Declarative • “Looks” like
Django Tuesday, December 28, 2010

Why Haystack? • Pluggable Backends • Support Solr & Whoosh
out of the box, Xapian with a third-party backend (boo GPL!) • Your code stays the same regardless of backend. Tuesday, December 28, 2010

Why Haystack? • Advanced Features • Faceting • More Like
This • Highlighting • Boost Tuesday, December 28, 2010

Why Haystack? • Integration with third-party apps • No need
to fork their code • Put the indexes in your code & register them • Applies to django.contrib as well. Tuesday, December 28, 2010

Why Haystack? • Real Live Documentation™! • http://docs.haystacksearch.org/dev/ • Test
Coverage! • Decent coverage • No new commits without tests Tuesday, December 28, 2010

Enough shameless self-promotion already! Tuesday, December 28, 2010

Using Haystack Tuesday, December 28, 2010

Two Phase Approach • The “Data In” is SearchIndex •
The “Data Out” is SearchQuerySet • Note: There’s a disconnect between your database & the search index Tuesday, December 28, 2010

SearchIndex Tuesday, December 28, 2010

SearchIndex • Provides the means to get data into the
index • Something of a cross between a Form (the data preparation aspects) and Model (the persistence) Tuesday, December 28, 2010

SearchIndex from haystack import indexes, site from myapp.models import Entry
class EntrySearchIndex(indexes.SearchIndex): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr=‘user__username’) created = indexes.DateTimeField() def get_queryset(self): return Entry.objects.published() def prepare_created(self, obj): return obj.pub_date or datetime.datetime.now() site.register(Entry, EntrySearchIndex) Tuesday, December 28, 2010

`use_template=True`? • Use Django templates to prep the data •
Example: # search/indexes/myapp/entry_text.txt {{ obj.title }} {{ obj.author.get_full_name }} {{ obj.tease }} {{ obj.content }} Tuesday, December 28, 2010

SearchQuerySet Tuesday, December 28, 2010

SearchQuerySet • The reason to use Haystack • Very powerful
• Forget views, forms, etc. They’re all thin wrappers around SearchQuerySet Tuesday, December 28, 2010

SearchQuerySet • Fetches data from the index • Very similar
to QuerySet • Intentional, to reduce conceptual overhead • Lazily evaluated • Chain methods Tuesday, December 28, 2010

SearchQuerySet • By default, searches across all models • Can
limit using SearchQuerySet.models • Caches where possible Tuesday, December 28, 2010

SearchQuerySet >>> import datetime >>> from haystack.query import SearchQuerySet >>>
sqs = SearchQuerySet().models(Entry) >>> sqs = sqs.filter(created__lte=datetime.datetime.now()) >>> sqs = sqs.exclude(author=‘daniel’) # Lazily performed the query when asked for results. >>> sqs [<SearchResult: myapp.entry (pk=u'5')>, <SearchResult: myapp.entry (pk=u'3')>, <SearchResult: myapp.entry (pk=u'2')>] # Iterable interface. # Still hasn’t hit the DB. >>> [result.author for result in sqs] [‘johndoe’, ‘sally1982’, ‘bob_the_third’] Tuesday, December 28, 2010

SearchQuerySet # Hits the database once per result. >>> [result.object.user.first_name
for result in sqs] [‘John’, ‘Sally’, ‘Bob’] # More efficient loading from database (one query total). >>> [result.object.user.first_name for result in sqs.load_all()] [‘John’, ‘Sally’, ‘Bob’] Tuesday, December 28, 2010

SearchView Tuesday, December 28, 2010

SearchView • Class-based view • Hit 80% of the regular
usage • A guideline to more advanced use • Relies heavily on SearchForm Tuesday, December 28, 2010

SearchForm Tuesday, December 28, 2010

SearchForm • Outside of using SearchQuerySet, it’s a standard Django
form • Deﬁnes a search method that does the necessary actions Tuesday, December 28, 2010

SearchForm from django import forms from haystack.forms import SearchForm from
myapp.models import Entry class EntrySearchForm(SearchForm): # Additional fields go here. author = forms.CharField(max_length=255, required=False) def search(self): sqs = super(EntrySearchForm, self).search() if self.cleaned_data.get(‘author’): sqs = sqs.filter(author=self.cleaned_data[‘author’]) return sqs Tuesday, December 28, 2010

SearchSite Tuesday, December 28, 2010

SearchSite • Registry pattern • Collects all registered SearchIndex classes
• Used by SearchQuerySet to limit results to only things Haystack knows about • Think django.contrib.admin.site. Tuesday, December 28, 2010

Haystack Best Practices Tuesday, December 28, 2010

Common Fields • Try to ﬁnd common ﬁelds as much
as possible • Reuse where it makes sense • But don’t shoehorn if it doesn’t work Tuesday, December 28, 2010

It’s Just Python • When an out-of-box doesn't work for
you, use SearchQuerySet & write what you need. • It's just Django & Python. Tuesday, December 28, 2010

load_all • Appropriate use of SearchQuerySet.load_all • One hit to
the DB per content type • But do you need to hit the DB? Tuesday, December 28, 2010

More Like This • Cheap & very worth it •
LJWorld saw a 30% jump in trafﬁc by adding it solely on story detail views. • Cache it! Tuesday, December 28, 2010

“Third Party” Apps • queued_search • https://github.com/toastdriven/ queued_search • saved_searches
• https://github.com/toastdriven/ saved_searches Tuesday, December 28, 2010

Other Ideas • Admin Integration • Integration with API •
Search “grouping” • Vertical search Tuesday, December 28, 2010

Solr Best Practices Tuesday, December 28, 2010

Tomcat vs. Jetty • Very close performance-wise • Tomcat better
when busy • Jetty is smaller on RAM & easier to run Tuesday, December 28, 2010

Tune JVM settings -Xms (Minimum size) -Xmx (Maximum size) #
Something close to... - ``java -Xms1G -Xmx12G -jar start.jar`` - -XX:+PrintGCDetails (print GC info) - -XX:+PrintGCTimeStamps (print GC info + timestamps) Tuesday, December 28, 2010

JMX Console • java -Dcom.sun.management.jmxremote -jar start.jar • Then jconsole
• Find jetty in the process list. • Lots of instrumentation Tuesday, December 28, 2010

• Proper query warming • The default “solr rocks” doesn’t.
• Remove unused handlers (like partition) Tune solrconﬁg Tuesday, December 28, 2010

Tune solrconﬁg • Tuning the mergeFactor • Not too high,
not too low • Big trade-off Tuesday, December 28, 2010

Schema • use omitNorms where possible • Only needed on
full-text ﬁelds • Same goes for indexed & stored • The fewer ﬁelds, the better Tuesday, December 28, 2010

Optimize! • Seriously. • Goes back through existing indexes &
cleans up • Takes awhile to run, so make sure your timeout is high (custom settings ﬁle) Tuesday, December 28, 2010

Commits • Commit as infrequently as is reasonable • Commit
as much as you can at once • queued_search shines here Tuesday, December 28, 2010

Debugging • Use &debugQuery=on to debug queries • Use the
browser interface! Tuesday, December 28, 2010

Advanced Bits • Learn & love the Solr stats page
Tuesday, December 28, 2010

• Replication Tuesday, December 28, 2010

• Replication • n-gram based autocomplete Tuesday, December 28, 2010

• Replication • n-gram based autocomplete • Spelling suggestions • the (Haystack) documented conﬁg sucks Tuesday, December 28, 2010

• Replication • n-gram based autocomplete • Spelling suggestions • the (Haystack) documented conﬁg sucks • Dismax Handler Tuesday, December 28, 2010

Resources • https://gist.github.com/215331 • http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/ Scaling-Lucene-and-Solr • http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot- camp-draft/ •
http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html • http://wiki.apache.org/solr/SolrJmx • http://wiki.apache.org/solr/LargeIndexes • http://wiki.apache.org/solr/SolrPerformanceFactors Tuesday, December 28, 2010

Resources • http://wiki.apache.org/solr/SolrReplication • http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/ • http://charlesleifer.com/blog/search-on-djangosnippetsorg/ • http://wiki.apache.org/solr/SpellCheckComponent Tuesday,
December 28, 2010

Enough Talk. Let’s Go Work With It. Tuesday, December 28,
2010

A Big Thanks To CMG Digital & @cmheisel For Having
Me! Tuesday, December 28, 2010

http://haystacksearch.org/ http://github.com/toastdriven/django-haystack #haystack on irc.freenode.net http://groups.google.com/group/django-haystack/ @daniellindsley on Twitter More
Information Tuesday, December 28, 2010

Getting The Most Out Of Haystack

Getting The Most Out Of Haystack

More Decks by daniellindsley

Other Decks in Programming

Featured

Transcript