Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Indexing and search tons of data with ElasticSearch and Django

Indexing and search tons of data with ElasticSearch and Django

ernestoarbitrio

April 09, 2017
Tweet

More Decks by ernestoarbitrio

Other Decks in Programming

Transcript

  1. PYCON OTTO Florence April 6-9 2017 Indexing and search tons

    of data with ElasticSearch and Django ernesto arbitrio
  2. PYCON OTTO whoami “Tech researcher at FBK(.eu)” “Python addicted, system

    integrator, dad, husband, cooking enthusiast” twitter: __pamaron__ email: [email protected] github.com/ernestoarbitrio
  3. PYCON OTTO What ElasticSearch is? Distributed search engine • Open

    Source • Document Based • JSON over HTTP protocol • Built on top of Apache© Lucene
  4. PYCON OTTO What ElasticSearch does? • Indexes heterogeneous data •

    Fast data searching • Support your data discovery applications
  5. PYCON OTTO Parent-child • The parent document can be updated

    without reindexing the children. • Child documents can be added, changed, or deleted without affecting either the parent or other children. This is especially useful when child documents are large in number and need to be added or changed frequently. • Child documents can be returned as the results of a search request.
  6. PYCON OTTO Queries (unstructured) Core methods match, multi_match, phrase, fuzzy,

    regexp, wildcard Compound queries bool, filtered, function score NOT CACHED
  7. PYCON OTTO Filters (structured) Core filters term, range, exists, geo_distance,

    geo_bbox, script Compound filters bool, and/or/not FAST, CACHEABLE The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.
  8. PYCON OTTO When to Use Which As a general rule,

    use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else.
  9. PYCON OTTO Pure ElasticSearch query EXAMPLE … over HTTP for

    instance curl -XPOST "http://localhost:9200/_search" -d '{ "aggs": { "posted_dates": { "terms": {"field": "PostedDate"} } }, "query": { "bool": { "filter": [{ "match": {"Categories": "A"} }], "must": [{ "match": {"Subject": "Lorem"} }], "must_not": [{ "match": {"Body": "amet"} }] } } }'
  10. PYCON OTTO Pure python query EXAMPLE import requests import json

    query={ 'aggs': { 'posted_dates': {'terms': {'field': 'PostedDate'}} }, 'query': { 'bool': { 'filter': [{ 'match': {'Categories': 'A'} }], 'must': [{ 'match': {'Subject': 'Lorem'} }], 'must_not': [{ 'match': {'Body': 'amet'} }] } } } uri = "http://localhost:9200/_search" response = requests.get(uri, data=query) results = json.loads(response.text)
  11. PYCON OTTO ElasticSearch Different deployment envs Nginx, PaaS, thrifth Restful

    APIs 90+ end points, 600+ parameters, escaping, encoding
  12. PYCON OTTO Document Example (demo email) { "Abstract": "Cara Elena,

    ho rivisto la norma operativa i...", "Body": "Cara Elena, some text here…..", "Categories": "", "CopyTo": [ "[email protected]" ], "DeliveredDate": "22/11/2011 12.53.25", "DeliveryPriority": "N", "DocumentCreated": "22/11/2011 12.53.17", "DocumentUniqueId": "908268A52DDAA2CF727724498EC4AF08", "Form": "Memo", "From": "Riccardo Mauri <[email protected]>", "PostedDate": "22/11/2011 12.53.17", "SendTo": [ "[email protected]" ], "Subject": "Rev norma operativa parti correlate" }
  13. PYCON OTTO elasticsearch-py | example In [1]: from elasticsearch import

    Elasticsearch In [2]: es = Elasticsearch() In [3]: es.indices.create(index='emails', ignore=400) Out[3]: {'acknowledged': True, 'shards_acknowledged': True} In [4]: es.index(index="emails", doc_type="email", id=4, body={}) Out[4]: {'_id': '4', '_index': 'emails', '_shards': {'failed': 0, 'successful': 1, 'total': 2}, '_type': 'email', '_version': 1, 'created': True, 'result': 'created'} In [5]: es.get(index="emails", doc_type="email", id=4)['_source'] Out [5]: {'Abstract': 'Cara Elena, ho rivisto la norma operativa i...', 'Body': 'Cara Elena, some text here…..', 'Categories': '', ... ,'Subject': 'Rev norma operativa parti correlate'} { "Abstract": "Cara Elena, ho rivisto", "Body": "Cara Elena, some text here…..", "Categories": "", "CopyTo": [ "[email protected]" ], … }
  14. PYCON OTTO elasticsearch-py | example res = es.search(body={ 'aggs': {

    'posted_dates': { 'terms': {'field': 'PostedDate'} } }, 'query': { 'bool': { 'filter': [{ 'match': {'Categories': 'A'} }], 'must': [{ 'match': {'Subject': 'Lorem'} }], 'must_not': [{ 'match': {'Body': 'amet'} }] } } })
  15. PYCON OTTO elasticsearch-dsl s = Search(using=client, index="emails").filter("match", Categories="A")\ .filter("match", Subject="Lorem")\

    .filter(~Q("match", Body="amet")) agg = A('terms', field = 'PostedDate') s.aggs.bucket('posted_dates', agg) response = s.execute()
  16. response Response object response = s.execute() if not response.success(): print("Partial

    results!") Iterate and get hits for h in response: print(h._meta.id, h.subject) Access to aggregations agg_date = response.aggregations.posted_dates.buckets[0] PYCON OTTO
  17. PYCON OTTO migration path query = { 'query': { 'bool':

    { 'filter': [{'match': {'Categories': 'A'}}], 'must': [{'match': {'Subject': 'Lorem'}}], 'must_not': [{'match': {'Body': 'amet'}}] } } } q = Search.from_dict(query) … query = s.to_dict()
  18. PYCON OTTO persistence example class email(DocType): subject = Text(analyzer='snowball', fields={'raw':

    Keyword()}) body = Text(analyzer='snowball') posted_date = Date() Lines = Integer() category = Text(analyzer='snowball') class Meta: index = 'emails' def save(self, ** kwargs): self.lines = len(self.body.split()) return super(email, self).save(** kwargs) def is_draft(self): return True is self.posted_date else False
  19. PYCON OTTO persistence example # create the mappings in elasticsearch

    email.init() # create and save and article message = email(meta={'id': 42}, subject='Hello world!') article.body = ''' looong text ''' article.posted_date = datetime.now() article.save() article = Article.get(id=42) print(article.is_draft())
  20. PYCON OTTO Display the data class SearchIndexView(LoginRequiredMixin, View): template_name =

    'search_engine/index.html' form_class = SenderForm def get(self, request, **kwargs): context = {'form': self.form_class()} return render(request, self.template_name, context) def post(self, request): person = request.POST['person'] # the person to filter page = request.GET.get('page', 1) from_ = (int(page) - 1) * settings.ES_PAGE_SIZE s = Search(using=client, index="emails").filter("match", From=person) results = s.execute() context = {'form': self.form_class(request.POST), 'tot_results': results['hits']['total'], 'res': results['hits']['hits']} return render(request, self.template_name, context)
  21. PYCON OTTO … and if we get thousand of results

    records? Pagination is the answer GET /_search?size=5 GET /_search?size=5&from=5 GET /_search?sizxe=5&from=10
  22. PYCON OTTO ES Custom Django Paginator class DSEPaginator(Paginator): def __init__(self,

    *args, **kwargs): super(DSEPaginator, self).__init__(*args, **kwargs) if isinstance(self.object_list, dict): self.count = self.object_list['hits']['total'] else: self.count = self.object_list.hits.total def page(self, number): number = self.validate_number(number) return Page(self.object_list, number, self)
  23. PYCON OTTO Display the data class SearchIndexView(LoginRequiredMixin, View): … …

    results = s.execute() paginator = DSEPaginator(results, settings.ES_PAGE_SIZE) try: posts = paginator.page(page) except PageNotAnInteger: posts = paginator.page(1) except EmptyPage: posts = paginator.page(paginator.num_pages) context = {'form': self.form_class(request.POST), 'tot_results': results['hits']['total'], 'posts': posts} return render(request, self.template_name, context) settings.ES_PAGE_SIZE = 25
  24. PYCON OTTO Highlighting GET /_search { "query" : { "match":

    { "content": "kimchy" } }, "pre_tags": ["<em>"], "post_tags": ["</em>"], "highlight" : { "fields" : { "content" : {} } } }
  25. PYCON OTTO @shared_task(bind=True, default_retry_delay=60, max_retries=3) def index_emails(self, pk): try: article

    = Email.objects.get(pk=pk) except Email.ObjectDoesNotExist: self.retry() try: search_email = SearchEmail.get(id=pk) except elasticsearch.NotFoundError: search_email = SearchEmail(id=pk) search_email.title = instance.title # ... search_email.save()
  26. PYCON OTTO What I want to see Autogenerated mapping Django

    Admin integration Management commands to (re)index … batteries included