Pro Yearly is on sale from $80 to $50! »

Finding the Needle (DjangoCon US 2013)

A3b1bb5e498495de407c0a2547982139?s=47 Ben Lopatin
September 03, 2013

Finding the Needle (DjangoCon US 2013)

The ability to find content on a site is important to users, and there are some great tools that make a sometimes tricky problem a lot simpler. This talk will address the search problem, introduce a few of the tools at your disposal, and provide a way to get started using Django Haystack.

The goals is for developers new to search to understand what's different about using a search engine as an additional service, to be aware of some of the "gotchas", and to know not just what's possible but how to get started.

A3b1bb5e498495de407c0a2547982139?s=128

Ben Lopatin

September 03, 2013
Tweet

Transcript

  1. n Hendrik 05 work Synopsis Pla es' Linkia levis was

    in fact P. lanc n's name in his 1810 work Prodromus Florae diae et Insulae Van Diemen, and echoed Persoon's vanilles' original name and specimen. In the 1995 ing revision of Finding the Needle, reviewed the m erial of Linkia levis, and found that Cavanilles had m erial from both Search and Django. He set one spe ee, which was clearly P. levis, as the lectotype, whi aterial with the description. Common names include eebung, willow geebung and smooth synonyms. Th derived from the language word. Like most other onia levis has seven chromosomes th her Proteaceae. In 1870, G gement of Pe th DjangoCon 2013 - Ben Lopatin
  2. ➜    ~  whoami Ben  Lopatin  (bennylope) ➜    ~

     echo  $HOME Richmond,  VA  /  Washington,  DC Principal and developer @ Wellfire Interactive
  3. 1.Understand the search problem 2.Role of the search engine 3.Nifty

    search features 4.Adding search with Haystack 5.Implementation strategies 6.Limitations and options
  4. Search text content

  5. Searching for & Looking for

  6. Use Cases

  7. None
  8. None
  9. None
  10. Search Engines

  11. A queryable inverted index of documents based on filtered tokens.

  12. Tokenized “These” “are” “not” “the” “droids” “you” “are” “looking” “for”

  13. Filtered “these” “droids” “you” “look” “These” “are” “not” “the” “droids”

    “you” “are” “looking” “for”
  14. Indexed

  15. Documents

  16. •ElasticSearch (Java/Lucene) •Solr (Java/Lucene) •Whoosh (Python) •Xapian (C++) •Sphinx (C++)

  17. Sounds like Mongo! Sounds like Mongo!

  18. What about SQL full-text indexing?

  19. Data in Data out

  20. ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS ANALYZERS

    ANALYZERS ANALYZERS
  21. Tokenizer + Filters

  22. Whitespace N-grams Word delimiters Snowball

  23. ASCII Stemming Lowercase Stop words Synonyms

  24. Language specific

  25. Querying

  26. Faceting

  27. Spell chexking

  28. Geospatial search

  29. Auto complete compile zone

  30. Highlighting

  31. Documents

  32. Let’s talk Django

  33. Haystack

  34. Just Pythonic abstraction

  35. ORM oriented

  36. Haystack components

  37. SearchIndex SearchQuerySet SearchForm SearchView

  38. SearchIndex SearchQuerySet SearchForm SearchView

  39. SearchIndex = data mapping Take Away

  40. SearchIndex SearchQuerySet SearchForm SearchView

  41. SearchIndex SearchQuerySet SearchForm SearchView

  42. SearchIndex SearchQuerySet SearchForm SearchView

  43. ElasticSearch

  44. Mac: brew install elasticsearch http://mxcl.github.com/homebrew/ Ubuntu/Debian Download .deb http://www.elasticsearch.org/download/

  45. Let searching guide index modeling

  46. Insert, update, remove

  47. Haystack management commands

  48. clear_index update_index rebuild_index flags: remove, age, batch-size, using, and more

  49. Indexing Strategies One time Real time Real time-ish (queued) Periodic

  50. Indexing Strategies One time Real time Real time-ish (queued) Periodic

  51. Indexing Strategies One time Real time Real time-ish (queued) Periodic

  52. Indexing Strategies One time Real time Real time-ish (queued) Periodic

  53. Indexing Strategies One time Real time Real time-ish (queued) Periodic

  54. Building indexed content

  55. Model attribute Templates Field method

  56. Model attribute Templates Field method

  57. Model attribute Templates Field method

  58. Model Queryset interfaces

  59. SearchIndex def get_model(self): return MyModel Default manager

  60. SearchIndex def get_queryset(self): return MyModel.objects.filter( active=True) Defined queryset

  61. One method to rull them all: .indexable()

  62. Geospatial querying

  63. def make_point(point_string): # Returns a Point or None from coords

    string def spatial_search_view(request): # Clean form, build initial searchqueryset bottom_left = make_point(request.GET.get('bl', '')) top_right = make_point(request.GET.get('tr', '')) if bottom_left and top_right: queryset = queryset.within('coords', bottom_left, bottom_right)
  64. queryset = queryset.distance('coords', center_point).order_by(‘distance’)

  65. Improving results quality

  66. Adjusting relevance

  67. Field boosting Document boosting Term boosting

  68. Field boosting Document boosting Term boosting somefield = indexes.CharField(boost=1.2)

  69. Field boosting Document boosting Term boosting def prepare(self, obj): data

    = super(ThisIndex, self).prepare(obj) data['_boost'] = 1.2 return data
  70. Field boosting Document boosting Term boosting sqs = SearchQuerySet().boost('banana', 1.1)

  71. Log searches, results & success

  72. Search engine as cache*

  73. Doing More with Haystack

  74. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  75. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  76. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  77. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  78. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  79. analysis/synonym.txt droid, robot, android => robot shag, carpet, rug, wookie

    => rug
  80. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  81. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  82. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  83. ElasticSearch analysis settings { "index" : { "analysis" : {

    "analyzer" : { "synonym" : { "tokenizer" : "whitespace", "filter" : ["synonyms"] } }, "filter" : { "synonyms" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } }
  84. How does Haystack map the index?

  85. if current_mapping != self.existing_mapping: try: # Make sure the index

    is there first. self.conn.create_index(self.index_name, self.DEFAULT_SETTINGS) self.conn.put_mapping(self.index_name, 'modelresult', current_mapping) self.existing_mapping = current_mapping except Exception: if not self.silently_fail: raise
  86. if field_class.field_type in ['date', 'datetime']: field_mapping['type'] = 'date' elif field_class.field_type

    == 'integer': field_mapping['type'] = 'long' elif field_class.field_type == 'float': field_mapping['type'] = 'float' elif field_class.field_type == 'boolean': field_mapping['type'] = 'boolean' elif field_class.field_type == 'ngram': field_mapping['analyzer'] = "ngram_analyzer" elif field_class.field_type == 'edge_ngram': field_mapping['analyzer'] = "edgengram_analyzer" elif field_class.field_type == 'location': field_mapping['type'] = 'geo_point' # ... code skipped here if field_mapping['type'] == 'string' and field_class.indexed: field_mapping["term_vector"] = "with_positions_offsets" if not hasattr(field_class, 'facet_for') and not\ field_class.field_type in('ngram', 'edge_ngram'): field_mapping["analyzer"] = "snowball"
  87. This page intentionally blank

  88. Use Your Own Backend

  89. New default analyzer

  90. user_analyzer = getattr(settings,'ELASTICSEARCH_DEFAULT_ANALYZER') if user_analyzer: setattr(self, 'DEFAULT_ANALYZER', user_analyzer)

  91. def build_schema(self, fields): content_field_name, mapping = super(ConfigurableElasticBackend, self).build_schema(fields) for field_name,

    field_class in fields.items(): field_mapping = mapping[field_class.index_fieldname] if field_mapping['type'] == 'string' and field_class.indexed: if not hasattr(field_class, 'facet_for') and not \ field_class.field_type in('ngram', 'edge_ngram'): field_mapping['analyzer'] = self.DEFAULT_ANALYZER) mapping.update({field_class.index_fieldname: field_mapping}) return (content_field_name, mapping)
  92. New search mapping

  93. class ConfigurableElasticBackend(ElasticsearchSearchBackend): def __init__(self, connection_alias, **connection_options): super(ConfigurableElasticBackend, self).__init__( connection_alias, **connection_options)

    user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS') if user_settings: setattr(self, 'DEFAULT_SETTINGS', user_settings)
  94. Pick analyzers by field

  95. def build_schema(self, fields): content_field_name, mapping = super(ConfigurableElasticBackend, self).build_schema(fields) for field_name,

    field_class in fields.items(): field_mapping = mapping[field_class.index_fieldname] if field_mapping['type'] == 'string' and field_class.indexed: if not hasattr(field_class, 'facet_for') and not \ field_class.field_type in('ngram', 'edge_ngram'): field_mapping['analyzer'] = getattr(field_class, 'analyzer', self.DEFAULT_ANALYZER) mapping.update({field_class.index_fieldname: field_mapping}) return (content_field_name, mapping)
  96. from haystack.fields import CharField as BaseCharField class ConfigurableFieldMixin(object): def __init__(self,

    **kwargs): self.analyzer = kwargs.pop('analyzer', None) super(ConfigurableFieldMixin, self).__init__(**kwargs) class CharField(ConfigurableFieldMixin, BaseCharField): pass
  97. SearchView: Non-CBV CBV

  98. Some gotchas

  99. Debugging search issues

  100. Search is hard

  101. n Hendrik 05 work Synopsis Pla es' Linkia levis was

    in fact P. lanc n's name in his 1810 work Prodromus Florae diae et Insulae Van Diemen, and echoed Persoon's vanilles' original name and specimen. In the 1995 ing revision of THE END, reviewed the mounted m a levis, and found that Cavanilles had mounted ma h Search and Django. He set one specimen of the t s clearly P. levis, as the lectotype, which aligned th e description. Common names include broad-leave llow geebung and smooth synonyms. The term ge om the language word. Like most other members o vis has seven chromosomes that are la e. In 1870, George Benth rsoonia in Volu nu DjangoCon 2013 - Ben Lopatin ciafactbook.herokuapp.com tinyurl.com/finding-the-needle