From __icontains to search

from __icontains to search Honza Král @honzakral

What is this "search" ?

search is interface to data

Find documents about Django >>> docs = Document.objects.filter(! ... Q(title__icontains="django")
|! ... Q(body__icontains="django")! ... )! $ grep -ri django *! Read all books in your library! Set aside those that mention Django!

Didn't we already solve this?

Bible concordance A simple form lists Biblical words alphabetically, with
indications to enable the inquirer to ﬁnd the passages of the Bible where the words occur. ! The ﬁrst concordance, completed in 1230, was undertaken under the guidance of Hugo de Saint-Cher (Hugo de Sancto Charo), assisted by fellow Dominicans.

Inverted index python file_1.txt file_2.txt file_3.txt web file_2.txt file_3.txt file_2.txt
file_4.txt django file_3.txt flask jazz file_4.txt

search(python AND django) python file_1.txt file_2.txt file_3.txt file_2.txt file_4.txt django
file_3.txt flask jazz file_4.txt web file_2.txt file_3.txt

Phrase search python file_1.txt (4) file_2.txt (1, 3) file_3.txt (11,
42) web file_2.txt (2) file_3.txt (10)

search("python web") python file_1.txt (4) file_2.txt (1, 13) file_3.txt (11,
42) web file_2.txt (2) file_3.txt (10)

Merging sorted lists

Building an inverted index "Django is a high-level Python Web
framework that encourages rapid development and clean, pragmatic design." django high level python web framework encourag rapid develop clean pragmat design fast

Analysis split into tokens drop stop words normalise lowercase stemming
synonyms

All happens at index time

Used for queries too!

Relevancy?

Relevancy! Lucene Similarity Can be ignored (was an attempt to
make query scores comparable across indices, it’s there for backward compatibility) Core TF/IDF weight Score of a document for a given query Normalized doc length, shorter docs are more likely to be relevant than longer docs Boost of query term t

Relevancy, for humans Positive factors (per term and field): rare
term repeat occurrence short field ...

Congratulations, we have built full-text search

Search

Highlighting

Highlighting encourag encourag [49:59] rapid rapid [60:65] fast [60:65] fast

Facets & Filtering

Facets Buckets: by term - list of possible values by
time interval ... Aggregates: count, sum/min/avg/...(field)

Kibana

Filtering No analysis - exact values Does not impact score
Fast/cacheable Perfect with facets for navigation

Did you mean?

Phrase suggestions Based on terms Suggests "closest" terms

Enough theory!

Python, you say? from elasticsearch import Elasticsearch! es = Elasticsearch()!
result = es.search(! index=settings.ES_INDEX,! body={! "query": {"match": {"title": "django}},! "facets": {"per_tag": {"terms": {"field": "tags"}}}! })! ! ! # soon (github.com/elasticsearch/elasticsearch-dsl-py)! from elasticsearch_dsl import Search! ! s = Search(using=es).query("match", title="django")! s.aggs.bucket("per_tag", "terms", field="tags")! ! result = s.execute()

Models, you say? def sync_to_es(instance, **kwargs):! es.index(! index=settings.ES_INDEX,! doc_type=str(instance._meta),! id=instance.pk,!
body=instance.to_json()) from elasticsearch.helpers import bulk! ! models = map(methodcaller('to_json'),! Model.objects.iterator())! ! bulk(! es, models,! index=settings.ES_INDEX,! doc_type=str(Model._meta))

Extra bits auto-complete sorting custom scoring geo-aware multi-fields relevancy tracking/training

auto-complete Search is a bad match relevancy makes no sense
speed is an issue (several ms is too slow) Completion suggester based on FST score and text supplied by user

sorting Requires access to all field values Destroys relevancy

custom score Influence the scoring by supplying a script by
taking a field value into account boost documents Much better than plain sorting

geo one way to influence score is with distance also
use geo in queries and filters aggregate based on geo

multi-fields one source field and analyzed multiple times: untouched for
filtering/faceting english french NGram

tracking and training track what people search for auto-complete those
queries number of click-throughs as score track relevancy train boosts to improve

thank you! Honza Král twitter: @honzakral email: [email protected] ! !
! ! ! ! ! ! • Support: http://elasticsearch.com/support • Training: http://training.elasticsearch.com/ • We are hiring: http://elasticsearch.com/about/jobs/

From __icontains to search

From __icontains to search

More Decks by Elasticsearch Inc

Other Decks in Technology

Featured

Transcript