Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding the Needle (DjangoCon US 2013)

Ben Lopatin
September 03, 2013

Finding the Needle (DjangoCon US 2013)

The ability to find content on a site is important to users, and there are some great tools that make a sometimes tricky problem a lot simpler. This talk will address the search problem, introduce a few of the tools at your disposal, and provide a way to get started using Django Haystack.

The goals is for developers new to search to understand what's different about using a search engine as an additional service, to be aware of some of the "gotchas", and to know not just what's possible but how to get started.

Ben Lopatin

September 03, 2013
Tweet

More Decks by Ben Lopatin

Other Decks in Programming

Transcript

  1. n Hendrik
    05 work Synopsis Pla
    es' Linkia levis was in fact P. lanc
    n's name in his 1810 work Prodromus Florae
    diae et Insulae Van Diemen, and echoed Persoon's
    vanilles' original name and specimen. In the 1995
    ing revision of Finding the Needle, reviewed the m
    erial of Linkia levis, and found that Cavanilles had m
    erial from both Search and Django. He set one spe
    ee, which was clearly P. levis, as the lectotype, whi
    aterial with the description. Common names include
    eebung, willow geebung and smooth synonyms. Th
    derived from the language word. Like most other
    onia levis has seven chromosomes th
    her Proteaceae. In 1870, G
    gement of Pe
    th
    DjangoCon 2013 - Ben Lopatin

    View Slide

  2. ➜    ~  whoami
    Ben  Lopatin  (bennylope)
    ➜    ~  echo  $HOME
    Richmond,  VA  /  Washington,  DC
    Principal and developer @
    Wellfire Interactive

    View Slide

  3. 1.Understand the search problem
    2.Role of the search engine
    3.Nifty search features
    4.Adding search with Haystack
    5.Implementation strategies
    6.Limitations and options

    View Slide

  4. Search text
    content

    View Slide

  5. Searching for
    &
    Looking for

    View Slide

  6. Use Cases

    View Slide

  7. View Slide

  8. View Slide

  9. View Slide

  10. Search Engines

    View Slide

  11. A queryable inverted index of
    documents based on filtered
    tokens.

    View Slide

  12. Tokenized
    “These” “are” “not” “the” “droids” “you” “are”
    “looking” “for”

    View Slide

  13. Filtered
    “these” “droids” “you” “look”
    “These” “are” “not” “the” “droids” “you” “are”
    “looking” “for”

    View Slide

  14. Indexed

    View Slide

  15. Documents

    View Slide

  16. •ElasticSearch (Java/Lucene)
    •Solr (Java/Lucene)
    •Whoosh (Python)
    •Xapian (C++)
    •Sphinx (C++)

    View Slide

  17. Sounds like
    Mongo!
    Sounds like
    Mongo!

    View Slide

  18. What about
    SQL full-text
    indexing?

    View Slide

  19. Data in
    Data out

    View Slide

  20. ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS
    ANALYZERS

    View Slide

  21. Tokenizer +
    Filters

    View Slide

  22. Whitespace
    N-grams
    Word delimiters
    Snowball

    View Slide

  23. ASCII
    Stemming
    Lowercase
    Stop words
    Synonyms

    View Slide

  24. Language
    specific

    View Slide

  25. Querying

    View Slide

  26. Faceting

    View Slide

  27. Spell chexking

    View Slide

  28. Geospatial
    search

    View Slide

  29. Auto
    complete
    compile
    zone

    View Slide

  30. Highlighting

    View Slide

  31. Documents

    View Slide

  32. Let’s talk
    Django

    View Slide

  33. Haystack

    View Slide

  34. Just Pythonic
    abstraction

    View Slide

  35. ORM oriented

    View Slide

  36. Haystack
    components

    View Slide

  37. SearchIndex
    SearchQuerySet
    SearchForm
    SearchView

    View Slide

  38. SearchIndex
    SearchQuerySet
    SearchForm
    SearchView

    View Slide

  39. SearchIndex = data mapping
    Take
    Away

    View Slide

  40. SearchIndex
    SearchQuerySet
    SearchForm
    SearchView

    View Slide

  41. SearchIndex
    SearchQuerySet
    SearchForm
    SearchView

    View Slide

  42. SearchIndex
    SearchQuerySet
    SearchForm
    SearchView

    View Slide

  43. ElasticSearch

    View Slide

  44. Mac:
    brew install elasticsearch
    http://mxcl.github.com/homebrew/
    Ubuntu/Debian
    Download .deb
    http://www.elasticsearch.org/download/

    View Slide

  45. Let searching
    guide index
    modeling

    View Slide

  46. Insert, update,
    remove

    View Slide

  47. Haystack
    management
    commands

    View Slide

  48. clear_index
    update_index
    rebuild_index
    flags:
    remove, age, batch-size, using,
    and more

    View Slide

  49. Indexing Strategies
    One time
    Real time
    Real time-ish (queued)
    Periodic

    View Slide

  50. Indexing Strategies
    One time
    Real time
    Real time-ish (queued)
    Periodic

    View Slide

  51. Indexing Strategies
    One time
    Real time
    Real time-ish (queued)
    Periodic

    View Slide

  52. Indexing Strategies
    One time
    Real time
    Real time-ish (queued)
    Periodic

    View Slide

  53. Indexing Strategies
    One time
    Real time
    Real time-ish (queued)
    Periodic

    View Slide

  54. Building
    indexed
    content

    View Slide

  55. Model attribute
    Templates
    Field method

    View Slide

  56. Model attribute
    Templates
    Field method

    View Slide

  57. Model attribute
    Templates
    Field method

    View Slide

  58. Model
    Queryset
    interfaces

    View Slide

  59. SearchIndex
    def get_model(self):
    return MyModel
    Default manager

    View Slide

  60. SearchIndex
    def get_queryset(self):
    return MyModel.objects.filter(
    active=True)
    Defined queryset

    View Slide

  61. One method to
    rull them all:
    .indexable()

    View Slide

  62. Geospatial
    querying

    View Slide

  63. def make_point(point_string):
    # Returns a Point or None from coords string
    def spatial_search_view(request):
    # Clean form, build initial searchqueryset
    bottom_left = make_point(request.GET.get('bl', ''))
    top_right = make_point(request.GET.get('tr', ''))
    if bottom_left and top_right:
    queryset = queryset.within('coords', bottom_left, bottom_right)

    View Slide

  64. queryset = queryset.distance('coords', center_point).order_by(‘distance’)

    View Slide

  65. Improving
    results quality

    View Slide

  66. Adjusting
    relevance

    View Slide

  67. Field boosting
    Document boosting
    Term boosting

    View Slide

  68. Field boosting
    Document boosting
    Term boosting
    somefield = indexes.CharField(boost=1.2)

    View Slide

  69. Field boosting
    Document boosting
    Term boosting
    def prepare(self, obj):
    data = super(ThisIndex, self).prepare(obj)
    data['_boost'] = 1.2
    return data

    View Slide

  70. Field boosting
    Document boosting
    Term boosting
    sqs = SearchQuerySet().boost('banana', 1.1)

    View Slide

  71. Log searches,
    results &
    success

    View Slide

  72. Search engine
    as cache*

    View Slide

  73. Doing More
    with Haystack

    View Slide

  74. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  75. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  76. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  77. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  78. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  79. analysis/synonym.txt
    droid, robot, android => robot
    shag, carpet, rug, wookie => rug

    View Slide

  80. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  81. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  82. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  83. ElasticSearch analysis settings
    {
    "index" : {
    "analysis" : {
    "analyzer" : {
    "synonym" : {
    "tokenizer" : "whitespace",
    "filter" : ["synonyms"]
    }
    },
    "filter" : {
    "synonyms" : {
    "type" : "synonym",
    "synonyms_path" : "analysis/synonym.txt"
    }
    }
    }
    }
    }

    View Slide

  84. How does
    Haystack map
    the index?

    View Slide

  85. if current_mapping != self.existing_mapping:
    try:
    # Make sure the index is there first.
    self.conn.create_index(self.index_name,
    self.DEFAULT_SETTINGS)
    self.conn.put_mapping(self.index_name, 'modelresult',
    current_mapping)
    self.existing_mapping = current_mapping
    except Exception:
    if not self.silently_fail:
    raise

    View Slide

  86. if field_class.field_type in ['date', 'datetime']:
    field_mapping['type'] = 'date'
    elif field_class.field_type == 'integer':
    field_mapping['type'] = 'long'
    elif field_class.field_type == 'float':
    field_mapping['type'] = 'float'
    elif field_class.field_type == 'boolean':
    field_mapping['type'] = 'boolean'
    elif field_class.field_type == 'ngram':
    field_mapping['analyzer'] = "ngram_analyzer"
    elif field_class.field_type == 'edge_ngram':
    field_mapping['analyzer'] = "edgengram_analyzer"
    elif field_class.field_type == 'location':
    field_mapping['type'] = 'geo_point'
    # ... code skipped here
    if field_mapping['type'] == 'string' and field_class.indexed:
    field_mapping["term_vector"] = "with_positions_offsets"
    if not hasattr(field_class, 'facet_for') and not\
    field_class.field_type in('ngram', 'edge_ngram'):
    field_mapping["analyzer"] = "snowball"

    View Slide

  87. This page intentionally blank

    View Slide

  88. Use Your Own
    Backend

    View Slide

  89. New default
    analyzer

    View Slide

  90. user_analyzer = getattr(settings,'ELASTICSEARCH_DEFAULT_ANALYZER')
    if user_analyzer:
    setattr(self, 'DEFAULT_ANALYZER', user_analyzer)

    View Slide

  91. def build_schema(self, fields):
    content_field_name, mapping = super(ConfigurableElasticBackend,
    self).build_schema(fields)
    for field_name, field_class in fields.items():
    field_mapping = mapping[field_class.index_fieldname]
    if field_mapping['type'] == 'string' and field_class.indexed:
    if not hasattr(field_class, 'facet_for') and not \
    field_class.field_type in('ngram', 'edge_ngram'):
    field_mapping['analyzer'] = self.DEFAULT_ANALYZER)
    mapping.update({field_class.index_fieldname: field_mapping})
    return (content_field_name, mapping)

    View Slide

  92. New search
    mapping

    View Slide

  93. class ConfigurableElasticBackend(ElasticsearchSearchBackend):
    def __init__(self, connection_alias, **connection_options):
    super(ConfigurableElasticBackend, self).__init__(
    connection_alias, **connection_options)
    user_settings = getattr(settings,
    'ELASTICSEARCH_INDEX_SETTINGS')
    if user_settings:
    setattr(self, 'DEFAULT_SETTINGS', user_settings)

    View Slide

  94. Pick analyzers
    by field

    View Slide

  95. def build_schema(self, fields):
    content_field_name, mapping = super(ConfigurableElasticBackend,
    self).build_schema(fields)
    for field_name, field_class in fields.items():
    field_mapping = mapping[field_class.index_fieldname]
    if field_mapping['type'] == 'string' and field_class.indexed:
    if not hasattr(field_class, 'facet_for') and not \
    field_class.field_type in('ngram', 'edge_ngram'):
    field_mapping['analyzer'] = getattr(field_class, 'analyzer',
    self.DEFAULT_ANALYZER)
    mapping.update({field_class.index_fieldname: field_mapping})
    return (content_field_name, mapping)

    View Slide

  96. from haystack.fields import CharField as BaseCharField
    class ConfigurableFieldMixin(object):
    def __init__(self, **kwargs):
    self.analyzer = kwargs.pop('analyzer', None)
    super(ConfigurableFieldMixin, self).__init__(**kwargs)
    class CharField(ConfigurableFieldMixin, BaseCharField):
    pass

    View Slide

  97. SearchView:
    Non-CBV CBV

    View Slide

  98. Some gotchas

    View Slide

  99. Debugging
    search issues

    View Slide

  100. Search is hard

    View Slide

  101. n Hendrik
    05 work Synopsis Pla
    es' Linkia levis was in fact P. lanc
    n's name in his 1810 work Prodromus Florae
    diae et Insulae Van Diemen, and echoed Persoon's
    vanilles' original name and specimen. In the 1995
    ing revision of THE END, reviewed the mounted m
    a levis, and found that Cavanilles had mounted ma
    h Search and Django. He set one specimen of the t
    s clearly P. levis, as the lectotype, which aligned th
    e description. Common names include broad-leave
    llow geebung and smooth synonyms. The term ge
    om the language word. Like most other members o
    vis has seven chromosomes that are la
    e. In 1870, George Benth
    rsoonia in Volu
    nu
    DjangoCon 2013 - Ben Lopatin
    ciafactbook.herokuapp.com
    tinyurl.com/finding-the-needle

    View Slide