Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From __icontains to search

From __icontains to search

Talk from DjangoCon Europe 2014 describing the anatomy of a search engine

Elasticsearch Inc

May 21, 2014
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. Find documents about Django >>> docs = Document.objects.filter(! ... Q(title__icontains="django")

    |! ... Q(body__icontains="django")! ... )! $ grep -ri django *! Read all books in your library! Set aside those that mention Django!
  2. Find documents about Django >>> docs = Document.objects.filter(! ... Q(title__icontains="django")

    |! ... Q(body__icontains="django")! ... )! $ grep -ri django *! Read all books in your library! Set aside those that mention Django!
  3. Bible concordance A simple form lists Biblical words alphabetically, with

    indications to enable the inquirer to find the passages of the Bible where the words occur. ! The first concordance, completed in 1230, was undertaken under the guidance of Hugo de Saint-Cher (Hugo de Sancto Charo), assisted by fellow Dominicans.
  4. Building an inverted index "Django is a high-level Python Web

    framework that encourages rapid development and clean, pragmatic design." django high level python web framework encourag rapid develop clean pragmat design fast
  5. Relevancy! Lucene Similarity Can be ignored (was an attempt to

    make query scores comparable across indices, it’s there for backward compatibility) Core TF/IDF weight Score of a document for a given query Normalized doc length, shorter docs are more likely to be relevant than longer docs Boost of query term t
  6. Facets Buckets: by term - list of possible values by

    time interval ... Aggregates: count, sum/min/avg/...(field)
  7. Filtering No analysis - exact values Does not impact score

    Fast/cacheable Perfect with facets for navigation
  8. Python, you say? from elasticsearch import Elasticsearch! es = Elasticsearch()!

    result = es.search(! index=settings.ES_INDEX,! body={! "query": {"match": {"title": "django}},! "facets": {"per_tag": {"terms": {"field": "tags"}}}! })! ! ! # soon (github.com/elasticsearch/elasticsearch-dsl-py)! from elasticsearch_dsl import Search! ! s = Search(using=es).query("match", title="django")! s.aggs.bucket("per_tag", "terms", field="tags")! ! result = s.execute()
  9. Models, you say? def sync_to_es(instance, **kwargs):! es.index(! index=settings.ES_INDEX,! doc_type=str(instance._meta),! id=instance.pk,!

    body=instance.to_json()) from elasticsearch.helpers import bulk! ! models = map(methodcaller('to_json'),! Model.objects.iterator())! ! bulk(! es, models,! index=settings.ES_INDEX,! doc_type=str(Model._meta))
  10. auto-complete Search is a bad match relevancy makes no sense

    speed is an issue (several ms is too slow) Completion suggester based on FST score and text supplied by user
  11. custom score Influence the scoring by supplying a script by

    taking a field value into account boost documents Much better than plain sorting
  12. geo one way to influence score is with distance also

    use geo in queries and filters aggregate based on geo
  13. tracking and training track what people search for auto-complete those

    queries number of click-throughs as score track relevancy train boosts to improve
  14. thank you! Honza Král twitter: @honzakral email: [email protected] ! !

    ! ! ! ! ! ! • Support: http://elasticsearch.com/support • Training: http://training.elasticsearch.com/ • We are hiring: http://elasticsearch.com/about/jobs/