Slide 1

Slide 1 text

PYCON OTTO Florence April 6-9 2017 Indexing and search tons of data with ElasticSearch and Django ernesto arbitrio

Slide 2

Slide 2 text

PYCON OTTO whoami “Tech researcher at FBK(.eu)” “Python addicted, system integrator, dad, husband, cooking enthusiast” twitter: __pamaron__ email: [email protected] github.com/ernestoarbitrio

Slide 3

Slide 3 text

PYCON OTTO the problem

Slide 4

Slide 4 text

PYCON OTTO the problem 30 TByte data

Slide 5

Slide 5 text

PYCON OTTO What ElasticSearch is? Distributed search engine ● Open Source ● Document Based ● JSON over HTTP protocol ● Built on top of Apache© Lucene

Slide 6

Slide 6 text

PYCON OTTO What ElasticSearch does? ● Indexes heterogeneous data ● Fast data searching ● Support your data discovery applications

Slide 7

Slide 7 text

PYCON OTTO Document Based ● JSON ● Dynamic Schema/Schemaless ● Relationship ○ Nested ○ Parent/Child

Slide 8

Slide 8 text

PYCON OTTO Nested example

Slide 9

Slide 9 text

PYCON OTTO Parent-child ● The parent document can be updated without reindexing the children. ● Child documents can be added, changed, or deleted without affecting either the parent or other children. This is especially useful when child documents are large in number and need to be added or changed frequently. ● Child documents can be returned as the results of a search request.

Slide 10

Slide 10 text

PYCON OTTO Queries (unstructured) Core methods match, multi_match, phrase, fuzzy, regexp, wildcard Compound queries bool, filtered, function score NOT CACHED

Slide 11

Slide 11 text

PYCON OTTO Filters (structured) Core filters term, range, exists, geo_distance, geo_bbox, script Compound filters bool, and/or/not FAST, CACHEABLE The goal of filtering is to reduce the number of documents that have to be examined by the scoring queries.

Slide 12

Slide 12 text

PYCON OTTO When to Use Which As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filters for everything else.

Slide 13

Slide 13 text

PYCON OTTO Pure ElasticSearch query EXAMPLE … over HTTP for instance curl -XPOST "http://localhost:9200/_search" -d '{ "aggs": { "posted_dates": { "terms": {"field": "PostedDate"} } }, "query": { "bool": { "filter": [{ "match": {"Categories": "A"} }], "must": [{ "match": {"Subject": "Lorem"} }], "must_not": [{ "match": {"Body": "amet"} }] } } }'

Slide 14

Slide 14 text

PYCON OTTO Pure python query EXAMPLE import requests import json query={ 'aggs': { 'posted_dates': {'terms': {'field': 'PostedDate'}} }, 'query': { 'bool': { 'filter': [{ 'match': {'Categories': 'A'} }], 'must': [{ 'match': {'Subject': 'Lorem'} }], 'must_not': [{ 'match': {'Body': 'amet'} }] } } } uri = "http://localhost:9200/_search" response = requests.get(uri, data=query) results = json.loads(response.text)

Slide 15

Slide 15 text

PYCON OTTO ElasticSearch Different deployment envs Nginx, PaaS, thrifth Restful APIs 90+ end points, 600+ parameters, escaping, encoding

Slide 16

Slide 16 text

PYCON OTTO elasticsearch-py Official low-level client for Elasticsearch 1-to-1 REST API Opinion-free Very extendable

Slide 17

Slide 17 text

PYCON OTTO Document Example (demo email) { "Abstract": "Cara Elena, ho rivisto la norma operativa i...", "Body": "Cara Elena, some text here…..", "Categories": "", "CopyTo": [ "[email protected]" ], "DeliveredDate": "22/11/2011 12.53.25", "DeliveryPriority": "N", "DocumentCreated": "22/11/2011 12.53.17", "DocumentUniqueId": "908268A52DDAA2CF727724498EC4AF08", "Form": "Memo", "From": "Riccardo Mauri ", "PostedDate": "22/11/2011 12.53.17", "SendTo": [ "[email protected]" ], "Subject": "Rev norma operativa parti correlate" }

Slide 18

Slide 18 text

PYCON OTTO elasticsearch-py | example In [1]: from elasticsearch import Elasticsearch In [2]: es = Elasticsearch() In [3]: es.indices.create(index='emails', ignore=400) Out[3]: {'acknowledged': True, 'shards_acknowledged': True} In [4]: es.index(index="emails", doc_type="email", id=4, body={}) Out[4]: {'_id': '4', '_index': 'emails', '_shards': {'failed': 0, 'successful': 1, 'total': 2}, '_type': 'email', '_version': 1, 'created': True, 'result': 'created'} In [5]: es.get(index="emails", doc_type="email", id=4)['_source'] Out [5]: {'Abstract': 'Cara Elena, ho rivisto la norma operativa i...', 'Body': 'Cara Elena, some text here…..', 'Categories': '', ... ,'Subject': 'Rev norma operativa parti correlate'} { "Abstract": "Cara Elena, ho rivisto", "Body": "Cara Elena, some text here…..", "Categories": "", "CopyTo": [ "[email protected]" ], … }

Slide 19

Slide 19 text

PYCON OTTO elasticsearch-py | example res = es.search(body={ 'aggs': { 'posted_dates': { 'terms': {'field': 'PostedDate'} } }, 'query': { 'bool': { 'filter': [{ 'match': {'Categories': 'A'} }], 'must': [{ 'match': {'Subject': 'Lorem'} }], 'must_not': [{ 'match': {'Body': 'amet'} }] } } })

Slide 20

Slide 20 text

PYCON OTTO elasticsearch-dsl s = Search(using=client, index="emails").filter("match", Categories="A")\ .filter("match", Subject="Lorem")\ .filter(~Q("match", Body="amet")) agg = A('terms', field = 'PostedDate') s.aggs.bucket('posted_dates', agg) response = s.execute()

Slide 21

Slide 21 text

response Response object response = s.execute() if not response.success(): print("Partial results!") Iterate and get hits for h in response: print(h._meta.id, h.subject) Access to aggregations agg_date = response.aggregations.posted_dates.buckets[0] PYCON OTTO

Slide 22

Slide 22 text

PYCON OTTO migration path query = { 'query': { 'bool': { 'filter': [{'match': {'Categories': 'A'}}], 'must': [{'match': {'Subject': 'Lorem'}}], 'must_not': [{'match': {'Body': 'amet'}}] } } } q = Search.from_dict(query) … query = s.to_dict()

Slide 23

Slide 23 text

PYCON OTTO persistence example class email(DocType): subject = Text(analyzer='snowball', fields={'raw': Keyword()}) body = Text(analyzer='snowball') posted_date = Date() Lines = Integer() category = Text(analyzer='snowball') class Meta: index = 'emails' def save(self, ** kwargs): self.lines = len(self.body.split()) return super(email, self).save(** kwargs) def is_draft(self): return True is self.posted_date else False

Slide 24

Slide 24 text

PYCON OTTO persistence example # create the mappings in elasticsearch email.init() # create and save and article message = email(meta={'id': 42}, subject='Hello world!') article.body = ''' looong text ''' article.posted_date = datetime.now() article.save() article = Article.get(id=42) print(article.is_draft())

Slide 25

Slide 25 text

PYCON OTTO +

Slide 26

Slide 26 text

PYCON OTTO Display the data class SearchIndexView(LoginRequiredMixin, View): template_name = 'search_engine/index.html' form_class = SenderForm def get(self, request, **kwargs): context = {'form': self.form_class()} return render(request, self.template_name, context) def post(self, request): person = request.POST['person'] # the person to filter page = request.GET.get('page', 1) from_ = (int(page) - 1) * settings.ES_PAGE_SIZE s = Search(using=client, index="emails").filter("match", From=person) results = s.execute() context = {'form': self.form_class(request.POST), 'tot_results': results['hits']['total'], 'res': results['hits']['hits']} return render(request, self.template_name, context)

Slide 27

Slide 27 text

PYCON OTTO … and if we get thousand of results records? Pagination is the answer GET /_search?size=5 GET /_search?size=5&from=5 GET /_search?sizxe=5&from=10

Slide 28

Slide 28 text

PYCON OTTO ES Custom Django Paginator class DSEPaginator(Paginator): def __init__(self, *args, **kwargs): super(DSEPaginator, self).__init__(*args, **kwargs) if isinstance(self.object_list, dict): self.count = self.object_list['hits']['total'] else: self.count = self.object_list.hits.total def page(self, number): number = self.validate_number(number) return Page(self.object_list, number, self)

Slide 29

Slide 29 text

PYCON OTTO Display the data class SearchIndexView(LoginRequiredMixin, View): … … results = s.execute() paginator = DSEPaginator(results, settings.ES_PAGE_SIZE) try: posts = paginator.page(page) except PageNotAnInteger: posts = paginator.page(1) except EmptyPage: posts = paginator.page(paginator.num_pages) context = {'form': self.form_class(request.POST), 'tot_results': results['hits']['total'], 'posts': posts} return render(request, self.template_name, context) settings.ES_PAGE_SIZE = 25

Slide 30

Slide 30 text

PYCON OTTO Highlighting GET /_search { "query" : { "match": { "content": "kimchy" } }, "pre_tags": [""], "post_tags": [""], "highlight" : { "fields" : { "content" : {} } } }

Slide 31

Slide 31 text

PYCON OTTO Highlighting search().highlight('subject', 'body')

Slide 32

Slide 32 text

PYCON OTTO PROBLEMS

Slide 33

Slide 33 text

PYCON OTTO Indexing happens during request time Celery can helps you

Slide 34

Slide 34 text

PYCON OTTO @shared_task(bind=True, default_retry_delay=60, max_retries=3) def index_emails(self, pk): try: article = Email.objects.get(pk=pk) except Email.ObjectDoesNotExist: self.retry() try: search_email = SearchEmail.get(id=pk) except elasticsearch.NotFoundError: search_email = SearchEmail(id=pk) search_email.title = instance.title # ... search_email.save()

Slide 35

Slide 35 text

PYCON OTTO What I want to see Autogenerated mapping Django Admin integration Management commands to (re)index … batteries included

Slide 36

Slide 36 text

PYCON OTTO QUESTIONS?

Slide 37

Slide 37 text

PYCON OTTO THANKS!!! Keep in touch [email protected] twitter.com/__pamaron__