Finding the Needle (DjangoCon US 2013)

n Hendrik 05 work Synopsis Pla es' Linkia levis was
in fact P. lanc n's name in his 1810 work Prodromus Florae diae et Insulae Van Diemen, and echoed Persoon's vanilles' original name and specimen. In the 1995 ing revision of Finding the Needle, reviewed the m erial of Linkia levis, and found that Cavanilles had m erial from both Search and Django. He set one spe ee, which was clearly P. levis, as the lectotype, whi aterial with the description. Common names include eebung, willow geebung and smooth synonyms. Th derived from the language word. Like most other onia levis has seven chromosomes th her Proteaceae. In 1870, G gement of Pe th DjangoCon 2013 - Ben Lopatin

➜ ~ whoami Ben Lopatin (bennylope) ➜ ~
echo $HOME Richmond, VA / Washington, DC Principal and developer @ Wellﬁre Interactive

1.Understand the search problem 2.Role of the search engine 3.Nifty
search features 4.Adding search with Haystack 5.Implementation strategies 6.Limitations and options

Search text content

Searching for & Looking for

Use Cases

Search Engines

A queryable inverted index of documents based on ﬁltered tokens.

Tokenized “These” “are” “not” “the” “droids” “you” “are” “looking” “for”

Filtered “these” “droids” “you” “look” “These” “are” “not” “the” “droids”
“you” “are” “looking” “for”

Indexed

Documents

•ElasticSearch (Java/Lucene) •Solr (Java/Lucene) •Whoosh (Python) •Xapian (C++) •Sphinx (C++)

Sounds like Mongo! Sounds like Mongo!

What about SQL full-text indexing?

Data in Data out

ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS
ANALYZERS ANALYZERS

Tokenizer + Filters

Whitespace N-grams Word delimiters Snowball

ASCII Stemming Lowercase Stop words Synonyms

Language speciﬁc

Querying

Faceting

Spell chexking

Geospatial search

Auto complete compile zone

Highlighting

Documents

Let’s talk Django

Haystack

Just Pythonic abstraction

ORM oriented

Haystack components

SearchIndex SearchQuerySet SearchForm SearchView

SearchIndex = data mapping Take Away

SearchIndex SearchQuerySet SearchForm SearchView

ElasticSearch

Mac: brew install elasticsearch http://mxcl.github.com/homebrew/ Ubuntu/Debian Download .deb http://www.elasticsearch.org/download/

Let searching guide index modeling

Insert, update, remove

Haystack management commands

clear_index update_index rebuild_index ﬂags: remove, age, batch-size, using, and more

Indexing Strategies One time Real time Real time-ish (queued) Periodic

Building indexed content

Model attribute Templates Field method

Model Queryset interfaces

SearchIndex def get_model(self): return MyModel Default manager

SearchIndex def get_queryset(self): return MyModel.objects.filter( active=True) Deﬁned queryset

One method to rull them all: .indexable()

Geospatial querying

def make_point(point_string): # Returns a Point or None from coords
string def spatial_search_view(request): # Clean form, build initial searchqueryset bottom_left = make_point(request.GET.get('bl', '')) top_right = make_point(request.GET.get('tr', '')) if bottom_left and top_right: queryset = queryset.within('coords', bottom_left, bottom_right)

queryset = queryset.distance('coords', center_point).order_by(‘distance’)

Improving results quality

Adjusting relevance

Field boosting Document boosting Term boosting

Field boosting Document boosting Term boosting someﬁeld = indexes.CharField(boost=1.2)

Field boosting Document boosting Term boosting def prepare(self, obj): data
= super(ThisIndex, self).prepare(obj) data['_boost'] = 1.2 return data

Field boosting Document boosting Term boosting sqs = SearchQuerySet().boost('banana', 1.1)

Log searches, results & success

Search engine as cache*

Doing More with Haystack

ElasticSearch analysis settings { "index" : { "analysis" : {
"analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }

analysis/synonym.txt droid, robot, android => robot shag, carpet, rug, wookie
=> rug

ElasticSearch analysis settings { "index" : { "analysis" : {
"analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }

How does Haystack map the index?

if current_mapping != self.existing_mapping: try: # Make sure the index
is there first. self.conn.create_index(self.index_name, self.DEFAULT_SETTINGS) self.conn.put_mapping(self.index_name, 'modelresult', current_mapping) self.existing_mapping = current_mapping except Exception: if not self.silently_fail: raise

if field_class.field_type in ['date', 'datetime']: field_mapping['type'] = 'date' elif field_class.field_type
== 'integer': field_mapping['type'] = 'long' elif field_class.field_type == 'float': field_mapping['type'] = 'float' elif field_class.field_type == 'boolean': field_mapping['type'] = 'boolean' elif field_class.field_type == 'ngram': field_mapping['analyzer'] = "ngram_analyzer" elif field_class.field_type == 'edge_ngram': field_mapping['analyzer'] = "edgengram_analyzer" elif field_class.field_type == 'location': field_mapping['type'] = 'geo_point' # ... code skipped here if field_mapping['type'] == 'string' and field_class.indexed: field_mapping["term_vector"] = "with_positions_offsets" if not hasattr(field_class, 'facet_for') and not\ field_class.field_type in('ngram', 'edge_ngram'): field_mapping["analyzer"] = "snowball"

This page intentionally blank

Use Your Own Backend

New default analyzer

user_analyzer = getattr(settings,'ELASTICSEARCH_DEFAULT_ANALYZER') if user_analyzer: setattr(self, 'DEFAULT_ANALYZER', user_analyzer)

def build_schema(self, fields): content_field_name, mapping = super(ConfigurableElasticBackend, self).build_schema(fields) for field_name,
field_class in fields.items(): field_mapping = mapping[field_class.index_fieldname] if field_mapping['type'] == 'string' and field_class.indexed: if not hasattr(field_class, 'facet_for') and not \ field_class.field_type in('ngram', 'edge_ngram'): field_mapping['analyzer'] = self.DEFAULT_ANALYZER) mapping.update({field_class.index_fieldname: field_mapping}) return (content_field_name, mapping)

New search mapping

class ConfigurableElasticBackend(ElasticsearchSearchBackend): def __init__(self, connection_alias, **connection_options): super(ConfigurableElasticBackend, self).__init__( connection_alias, **connection_options)
user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS') if user_settings: setattr(self, 'DEFAULT_SETTINGS', user_settings)

Pick analyzers by ﬁeld

def build_schema(self, fields): content_field_name, mapping = super(ConfigurableElasticBackend, self).build_schema(fields) for field_name,
field_class in fields.items(): field_mapping = mapping[field_class.index_fieldname] if field_mapping['type'] == 'string' and field_class.indexed: if not hasattr(field_class, 'facet_for') and not \ field_class.field_type in('ngram', 'edge_ngram'): field_mapping['analyzer'] = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER) mapping.update({field_class.index_fieldname: field_mapping}) return (content_field_name, mapping)

from haystack.fields import CharField as BaseCharField class ConfigurableFieldMixin(object): def __init__(self,
**kwargs): self.analyzer = kwargs.pop('analyzer', None) super(ConfigurableFieldMixin, self).__init__(**kwargs) class CharField(ConfigurableFieldMixin, BaseCharField): pass

SearchView: Non-CBV CBV

Some gotchas

Debugging search issues

Search is hard

n Hendrik 05 work Synopsis Pla es' Linkia levis was
in fact P. lanc n's name in his 1810 work Prodromus Florae diae et Insulae Van Diemen, and echoed Persoon's vanilles' original name and specimen. In the 1995 ing revision of THE END, reviewed the mounted m a levis, and found that Cavanilles had mounted ma h Search and Django. He set one specimen of the t s clearly P. levis, as the lectotype, which aligned th e description. Common names include broad-leave llow geebung and smooth synonyms. The term ge om the language word. Like most other members o vis has seven chromosomes that are la e. In 1870, George Benth rsoonia in Volu nu DjangoCon 2013 - Ben Lopatin ciafactbook.herokuapp.com tinyurl.com/ﬁnding-the-needle

Finding the Needle (DjangoCon US 2013)

Finding the Needle (DjangoCon US 2013)

More Decks by Ben Lopatin

Other Decks in Programming

Featured

Transcript