Slide 1

Slide 1 text

Getting The Most Out Of Haystack Daniel Lindsley Pragmatic Badger, LLC Tuesday, December 28, 2010

Slide 2

Slide 2 text

Terminology Tuesday, December 28, 2010

Slide 3

Slide 3 text

“Engine” • The actual search engine • Here be interesting computer science problems • Examples: Solr, Xapian, Whoosh Tuesday, December 28, 2010

Slide 4

Slide 4 text

“Document” • A single record in the index • Usually accompanied by 1+ fields of metadata • Heavily processed Tuesday, December 28, 2010

Slide 5

Slide 5 text

“Corpus” • The collection of indexed documents • Latin for “body” Tuesday, December 28, 2010

Slide 6

Slide 6 text

“Stemming” • Find the root of the word • Part of the “magic” of search • More on this later... Tuesday, December 28, 2010

Slide 7

Slide 7 text

“Relevance” • A metric of how well a document matches the query • Search’s killer feature • Hard to get 100% right Tuesday, December 28, 2010

Slide 8

Slide 8 text

“Faceting” • Count of docs meeting certain criteria within your result set • Drill down! • Think Amazon/eBay • More on this later... Tuesday, December 28, 2010

Slide 9

Slide 9 text

“Boost” • A way to artificially increase the relevance of document • Types: Document/Field/Term Tuesday, December 28, 2010

Slide 10

Slide 10 text

Introduction to Search Tuesday, December 28, 2010

Slide 11

Slide 11 text

Search != RDBMS • The sooner you get over that, the easier everything that follows will be. • Think “document store”. Tuesday, December 28, 2010

Slide 12

Slide 12 text

Stemming • Porter-Stemmer or Snowball • The engine takes terms & hacks them down to the root word. • Examples: “testing” ! “test” “searchers” ! “searcher” ! “search” Tuesday, December 28, 2010

Slide 13

Slide 13 text

Inverted Index • The power of the engine starts here • Basically a reverse mapping between the stemmed form of a term to a collection of documents containing the term ...“search”: [3, 104, 238],... Tuesday, December 28, 2010

Slide 14

Slide 14 text

Inverted Index • Very fast lookups • NOT a “contains” or “like” lookup unless you say so (slower) Tuesday, December 28, 2010

Slide 15

Slide 15 text

Document Store • Flat structure • Generally free-form/schema-less • Easiest to think about each record as a dictionary • No relations built-in Tuesday, December 28, 2010

Slide 16

Slide 16 text

Why custom search? ...or... “Isn’t this what Google is for?” Tuesday, December 28, 2010

Slide 17

Slide 17 text

Why custom search? • You control what is (and is not) indexed Tuesday, December 28, 2010

Slide 18

Slide 18 text

Why custom search? • You control what is (and is not) indexed • Better quality data goes into the index Tuesday, December 28, 2010

Slide 19

Slide 19 text

Why custom search? • You control what is (and is not) indexed • Better quality data goes into the index • Information-specific handling Tuesday, December 28, 2010

Slide 20

Slide 20 text

Why custom search? • You control what is (and is not) indexed • Better quality data goes into the index • Information-specific handling • Provide context-specific search Tuesday, December 28, 2010

Slide 21

Slide 21 text

Introduction to Haystack Tuesday, December 28, 2010

Slide 22

Slide 22 text

What is Haystack? At its simplest, Haystack is an abstraction layer for integrating Django with a search engine. Tuesday, December 28, 2010

Slide 23

Slide 23 text

Why Haystack? • Familiar API • Declarative • “Looks” like Django Tuesday, December 28, 2010

Slide 24

Slide 24 text

Why Haystack? • Pluggable Backends • Support Solr & Whoosh out of the box, Xapian with a third-party backend (boo GPL!) • Your code stays the same regardless of backend. Tuesday, December 28, 2010

Slide 25

Slide 25 text

Why Haystack? • Advanced Features • Faceting • More Like This • Highlighting • Boost Tuesday, December 28, 2010

Slide 26

Slide 26 text

Why Haystack? • Integration with third-party apps • No need to fork their code • Put the indexes in your code & register them • Applies to django.contrib as well. Tuesday, December 28, 2010

Slide 27

Slide 27 text

Why Haystack? • Real Live Documentation™! • http://docs.haystacksearch.org/dev/ • Test Coverage! • Decent coverage • No new commits without tests Tuesday, December 28, 2010

Slide 28

Slide 28 text

Enough shameless self-promotion already! Tuesday, December 28, 2010

Slide 29

Slide 29 text

Using Haystack Tuesday, December 28, 2010

Slide 30

Slide 30 text

Two Phase Approach • The “Data In” is SearchIndex • The “Data Out” is SearchQuerySet • Note: There’s a disconnect between your database & the search index Tuesday, December 28, 2010

Slide 31

Slide 31 text

SearchIndex Tuesday, December 28, 2010

Slide 32

Slide 32 text

SearchIndex • Provides the means to get data into the index • Something of a cross between a Form (the data preparation aspects) and Model (the persistence) Tuesday, December 28, 2010

Slide 33

Slide 33 text

SearchIndex from haystack import indexes, site from myapp.models import Entry class EntrySearchIndex(indexes.SearchIndex): text = indexes.CharField(document=True, use_template=True) author = indexes.CharField(model_attr=‘user__username’) created = indexes.DateTimeField() def get_queryset(self): return Entry.objects.published() def prepare_created(self, obj): return obj.pub_date or datetime.datetime.now() site.register(Entry, EntrySearchIndex) Tuesday, December 28, 2010

Slide 34

Slide 34 text

`use_template=True`? • Use Django templates to prep the data • Example: # search/indexes/myapp/entry_text.txt {{ obj.title }} {{ obj.author.get_full_name }} {{ obj.tease }} {{ obj.content }} Tuesday, December 28, 2010

Slide 35

Slide 35 text

SearchQuerySet Tuesday, December 28, 2010

Slide 36

Slide 36 text

SearchQuerySet • The reason to use Haystack • Very powerful • Forget views, forms, etc. They’re all thin wrappers around SearchQuerySet Tuesday, December 28, 2010

Slide 37

Slide 37 text

SearchQuerySet • Fetches data from the index • Very similar to QuerySet • Intentional, to reduce conceptual overhead • Lazily evaluated • Chain methods Tuesday, December 28, 2010

Slide 38

Slide 38 text

SearchQuerySet • By default, searches across all models • Can limit using SearchQuerySet.models • Caches where possible Tuesday, December 28, 2010

Slide 39

Slide 39 text

SearchQuerySet >>> import datetime >>> from haystack.query import SearchQuerySet >>> sqs = SearchQuerySet().models(Entry) >>> sqs = sqs.filter(created__lte=datetime.datetime.now()) >>> sqs = sqs.exclude(author=‘daniel’) # Lazily performed the query when asked for results. >>> sqs [, , ] # Iterable interface. # Still hasn’t hit the DB. >>> [result.author for result in sqs] [‘johndoe’, ‘sally1982’, ‘bob_the_third’] Tuesday, December 28, 2010

Slide 40

Slide 40 text

SearchQuerySet # Hits the database once per result. >>> [result.object.user.first_name for result in sqs] [‘John’, ‘Sally’, ‘Bob’] # More efficient loading from database (one query total). >>> [result.object.user.first_name for result in sqs.load_all()] [‘John’, ‘Sally’, ‘Bob’] Tuesday, December 28, 2010

Slide 41

Slide 41 text

SearchView Tuesday, December 28, 2010

Slide 42

Slide 42 text

SearchView • Class-based view • Hit 80% of the regular usage • A guideline to more advanced use • Relies heavily on SearchForm Tuesday, December 28, 2010

Slide 43

Slide 43 text

SearchForm Tuesday, December 28, 2010

Slide 44

Slide 44 text

SearchForm • Outside of using SearchQuerySet, it’s a standard Django form • Defines a search method that does the necessary actions Tuesday, December 28, 2010

Slide 45

Slide 45 text

SearchForm from django import forms from haystack.forms import SearchForm from myapp.models import Entry class EntrySearchForm(SearchForm): # Additional fields go here. author = forms.CharField(max_length=255, required=False) def search(self): sqs = super(EntrySearchForm, self).search() if self.cleaned_data.get(‘author’): sqs = sqs.filter(author=self.cleaned_data[‘author’]) return sqs Tuesday, December 28, 2010

Slide 46

Slide 46 text

SearchSite Tuesday, December 28, 2010

Slide 47

Slide 47 text

SearchSite • Registry pattern • Collects all registered SearchIndex classes • Used by SearchQuerySet to limit results to only things Haystack knows about • Think django.contrib.admin.site. Tuesday, December 28, 2010

Slide 48

Slide 48 text

Haystack Best Practices Tuesday, December 28, 2010

Slide 49

Slide 49 text

Common Fields • Try to find common fields as much as possible • Reuse where it makes sense • But don’t shoehorn if it doesn’t work Tuesday, December 28, 2010

Slide 50

Slide 50 text

It’s Just Python • When an out-of-box doesn't work for you, use SearchQuerySet & write what you need. • It's just Django & Python. Tuesday, December 28, 2010

Slide 51

Slide 51 text

load_all • Appropriate use of SearchQuerySet.load_all • One hit to the DB per content type • But do you need to hit the DB? Tuesday, December 28, 2010

Slide 52

Slide 52 text

More Like This • Cheap & very worth it • LJWorld saw a 30% jump in traffic by adding it solely on story detail views. • Cache it! Tuesday, December 28, 2010

Slide 53

Slide 53 text

“Third Party” Apps • queued_search • https://github.com/toastdriven/ queued_search • saved_searches • https://github.com/toastdriven/ saved_searches Tuesday, December 28, 2010

Slide 54

Slide 54 text

Other Ideas • Admin Integration • Integration with API • Search “grouping” • Vertical search Tuesday, December 28, 2010

Slide 55

Slide 55 text

Solr Best Practices Tuesday, December 28, 2010

Slide 56

Slide 56 text

Tomcat vs. Jetty • Very close performance-wise • Tomcat better when busy • Jetty is smaller on RAM & easier to run Tuesday, December 28, 2010

Slide 57

Slide 57 text

Tune JVM settings -Xms (Minimum size) -Xmx (Maximum size) # Something close to... - ``java -Xms1G -Xmx12G -jar start.jar`` - -XX:+PrintGCDetails (print GC info) - -XX:+PrintGCTimeStamps (print GC info + timestamps) Tuesday, December 28, 2010

Slide 58

Slide 58 text

JMX Console • java -Dcom.sun.management.jmxremote -jar start.jar • Then jconsole • Find jetty in the process list. • Lots of instrumentation Tuesday, December 28, 2010

Slide 59

Slide 59 text

• Proper query warming • The default “solr rocks” doesn’t. • Remove unused handlers (like partition) Tune solrconfig Tuesday, December 28, 2010

Slide 60

Slide 60 text

Tune solrconfig • Tuning the mergeFactor • Not too high, not too low • Big trade-off Tuesday, December 28, 2010

Slide 61

Slide 61 text

Schema • use omitNorms where possible • Only needed on full-text fields • Same goes for indexed & stored • The fewer fields, the better Tuesday, December 28, 2010

Slide 62

Slide 62 text

Optimize! • Seriously. • Goes back through existing indexes & cleans up • Takes awhile to run, so make sure your timeout is high (custom settings file) Tuesday, December 28, 2010

Slide 63

Slide 63 text

Commits • Commit as infrequently as is reasonable • Commit as much as you can at once • queued_search shines here Tuesday, December 28, 2010

Slide 64

Slide 64 text

Debugging • Use &debugQuery=on to debug queries • Use the browser interface! Tuesday, December 28, 2010

Slide 65

Slide 65 text

Advanced Bits • Learn & love the Solr stats page Tuesday, December 28, 2010

Slide 66

Slide 66 text

Advanced Bits • Learn & love the Solr stats page • Replication Tuesday, December 28, 2010

Slide 67

Slide 67 text

Advanced Bits • Learn & love the Solr stats page • Replication • n-gram based autocomplete Tuesday, December 28, 2010

Slide 68

Slide 68 text

Advanced Bits • Learn & love the Solr stats page • Replication • n-gram based autocomplete • Spelling suggestions • the (Haystack) documented config sucks Tuesday, December 28, 2010

Slide 69

Slide 69 text

Advanced Bits • Learn & love the Solr stats page • Replication • n-gram based autocomplete • Spelling suggestions • the (Haystack) documented config sucks • Dismax Handler Tuesday, December 28, 2010

Slide 70

Slide 70 text

Resources • https://gist.github.com/215331 • http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/ Scaling-Lucene-and-Solr • http://www.lucidimagination.com/blog/2009/09/19/java-garbage-collection-boot- camp-draft/ • http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html • http://wiki.apache.org/solr/SolrJmx • http://wiki.apache.org/solr/LargeIndexes • http://wiki.apache.org/solr/SolrPerformanceFactors Tuesday, December 28, 2010

Slide 71

Slide 71 text

Resources • http://wiki.apache.org/solr/SolrReplication • http://www.yashh.com/blog/2010/nov/03/autocomplete-solr/ • http://charlesleifer.com/blog/search-on-djangosnippetsorg/ • http://wiki.apache.org/solr/SpellCheckComponent Tuesday, December 28, 2010

Slide 72

Slide 72 text

Enough Talk. Let’s Go Work With It. Tuesday, December 28, 2010

Slide 73

Slide 73 text

A Big Thanks To CMG Digital & @cmheisel For Having Me! Tuesday, December 28, 2010

Slide 74

Slide 74 text

http://haystacksearch.org/ http://github.com/toastdriven/django-haystack #haystack on irc.freenode.net http://groups.google.com/group/django-haystack/ @daniellindsley on Twitter More Information Tuesday, December 28, 2010