Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From __icontains to search

From __icontains to search

Talk from DjangoCon Europe 2014 describing the anatomy of a search engine

Elasticsearch Inc

May 21, 2014
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. from __icontains to search Honza Král @honzakral

  2. What is this "search" ?

  3. search is interface to data

  4. Find documents about Django >>> docs = Document.objects.filter(! ... Q(title__icontains="django")

    |! ... Q(body__icontains="django")! ... )! $ grep -ri django *! Read all books in your library! Set aside those that mention Django!
  5. Find documents about Django >>> docs = Document.objects.filter(! ... Q(title__icontains="django")

    |! ... Q(body__icontains="django")! ... )! $ grep -ri django *! Read all books in your library! Set aside those that mention Django!
  6. Didn't we already solve this?

  7. Bible concordance A simple form lists Biblical words alphabetically, with

    indications to enable the inquirer to find the passages of the Bible where the words occur. ! The first concordance, completed in 1230, was undertaken under the guidance of Hugo de Saint-Cher (Hugo de Sancto Charo), assisted by fellow Dominicans.
  8. Inverted index python file_1.txt file_2.txt file_3.txt web file_2.txt file_3.txt file_2.txt

    file_4.txt django file_3.txt flask jazz file_4.txt
  9. search(python AND django) python file_1.txt file_2.txt file_3.txt file_2.txt file_4.txt django

    file_3.txt flask jazz file_4.txt web file_2.txt file_3.txt
  10. Phrase search python file_1.txt (4) file_2.txt (1, 3) file_3.txt (11,

    42) web file_2.txt (2) file_3.txt (10)
  11. search("python web") python file_1.txt (4) file_2.txt (1, 13) file_3.txt (11,

    42) web file_2.txt (2) file_3.txt (10)
  12. Merging sorted lists

  13. Building an inverted index "Django is a high-level Python Web

    framework that encourages rapid development and clean, pragmatic design." django high level python web framework encourag rapid develop clean pragmat design fast
  14. Analysis split into tokens drop stop words normalise lowercase stemming

    synonyms
  15. All happens at index time

  16. Used for queries too!

  17. Relevancy?

  18. Relevancy! Lucene Similarity Can be ignored (was an attempt to

    make query scores comparable across indices, it’s there for backward compatibility) Core TF/IDF weight Score of a document for a given query Normalized doc length, shorter docs are more likely to be relevant than longer docs Boost of query term t
  19. Relevancy, for humans Positive factors (per term and field): rare

    term repeat occurrence short field ...
  20. Congratulations, we have built full-text search

  21. Search

  22. Highlighting

  23. Highlighting encourag encourag [49:59] rapid rapid [60:65] fast [60:65] fast

  24. Facets & Filtering

  25. Facets Buckets: by term - list of possible values by

    time interval ... Aggregates: count, sum/min/avg/...(field)
  26. Kibana

  27. Filtering No analysis - exact values Does not impact score

    Fast/cacheable Perfect with facets for navigation
  28. Did you mean?

  29. Phrase suggestions Based on terms Suggests "closest" terms

  30. Enough theory!

  31. Python, you say? from elasticsearch import Elasticsearch! es = Elasticsearch()!

    result = es.search(! index=settings.ES_INDEX,! body={! "query": {"match": {"title": "django}},! "facets": {"per_tag": {"terms": {"field": "tags"}}}! })! ! ! # soon (github.com/elasticsearch/elasticsearch-dsl-py)! from elasticsearch_dsl import Search! ! s = Search(using=es).query("match", title="django")! s.aggs.bucket("per_tag", "terms", field="tags")! ! result = s.execute()
  32. Models, you say? def sync_to_es(instance, **kwargs):! es.index(! index=settings.ES_INDEX,! doc_type=str(instance._meta),! id=instance.pk,!

    body=instance.to_json()) from elasticsearch.helpers import bulk! ! models = map(methodcaller('to_json'),! Model.objects.iterator())! ! bulk(! es, models,! index=settings.ES_INDEX,! doc_type=str(Model._meta))
  33. Extra bits auto-complete sorting custom scoring geo-aware multi-fields relevancy tracking/training

  34. auto-complete Search is a bad match relevancy makes no sense

    speed is an issue (several ms is too slow) Completion suggester based on FST score and text supplied by user
  35. sorting Requires access to all field values Destroys relevancy

  36. custom score Influence the scoring by supplying a script by

    taking a field value into account boost documents Much better than plain sorting
  37. geo one way to influence score is with distance also

    use geo in queries and filters aggregate based on geo
  38. multi-fields one source field and analyzed multiple times: untouched for

    filtering/faceting english french NGram
  39. tracking and training track what people search for auto-complete those

    queries number of click-throughs as score track relevancy train boosts to improve
  40. thank you! Honza Král twitter: @honzakral email: [email protected] ! !

    ! ! ! ! ! ! • Support: http://elasticsearch.com/support • Training: http://training.elasticsearch.com/ • We are hiring: http://elasticsearch.com/about/jobs/