Explore your data with Elasticsearch

Explore your data with Elasticsearch

Presentation on PyCon HK by Honza Král

098332e9d988080a9057816f84d668f7?s=128

Elasticsearch Inc

November 07, 2015
Tweet

Transcript

  1. Honza Král @honzakral Explore your data with Elasticsearch

  2. Elasticsearch

  3. Distributed Search Engine Open Source
 
 Document-based
 
 Based on

    Lucene 
 JSON over HTTP
  4. Document based JSON
 Dynamic Schema
 Some Relationships Nested Parent/Child

  5. { "id": 7635, "accepted_answer_id": 7641, "answer_count": 9, "title": "Are you

    able to close your eyes and focus/think just on your code?", "body": "How do I ......?", "comment_count": 2, "comments": [{ "creation_date": "2010-09-27T19:31:27.200", "id": 9372, "owner": { "display_name": "sange", "id": 3092 }, "post_id": 7635, "text": "I sometimes close my eyes or stare at something ....." }, {......}], "favorite_count": 2, "last_activity_date": "2010-09-28T00:28:08.393", "owner": { "display_name": "flow", "id": 3761 }, "rating": 6, "tags": [ "focus", "concentration" ], "view_count": 368, "creation_date": "2010-09-27T19:16:57.757", "closed_date": "2011-11-13T12:12:05.937" } StackOverflow Question
  6. Search

  7. Full Text (unstructured) in or across fields phrase, fuzzy, ...

    scan api for data extraction relies on analysis
  8. Filtering (structured) exact matches, ranges, geo, ... fast cacheable as

    bitsets core filters are cached, not compound filters (bool/and/or)
  9. Under the Hood

  10. Bible concordance A simple form lists Biblical words alphabetically, with

    indications to enable the inquirer to find the passages of the Bible where the words occur. The first concordance, completed in 1230, was undertaken under the guidance of Hugo de Saint-Cher (Hugo de Sancto Charo), assisted by fellow Dominicans.
  11. Inverted Index

  12. Building an inverted index "Django is a high-level Python Web

    framework that encourages rapid development and clean, pragmatic design." django high level python web framework encourag rapid develop clean pragmat design fast
  13. Inverted index python file_1.txt file_2.txt file_3.txt web file_2.txt file_3.txt file_2.txt

    file_4.txt django file_3.txt flask jazz file_4.txt
  14. search(python AND django) python file_1.txt file_2.txt file_3.txt file_2.txt file_4.txt django

    file_3.txt flask jazz file_4.txt web file_2.txt file_3.txt
  15. Phrase search python file_1.txt (4) file_2.txt (1, 3) file_3.txt (11,

    42) web file_2.txt (2) file_3.txt (10)
  16. search("python web") python file_1.txt (4) file_2.txt (1, 13) file_3.txt (11,

    42) web file_2.txt (2) file_3.txt (10)
  17. Merging sorted lists.

  18. Flexible Easily distributable

  19. Aggregations

  20. Metrics in Buckets Buckets split documents into groups can be

    nested Metrics calculated over documents in given bucket
  21. Buckets terms bucket per field value - "category" significant terms

    terms specific for this bucket - "uncommonly common" range per range - "age" geo_range/geohash_grid distance ranges (date_)histogram buckets per time interval - "daily" ...
  22. Metrics count/sum/avg/min/max/... (extended) stats including std deviation, sum of squares

    etc top_hits cardinality percentiles ...
  23. Mix and Match

  24. Example "aggs" : { "states" : { "terms" : {

    "field" : "state" }, "aggs" : { "age_groups" : { "histogram" : { "field" : "age", "interval" : 5 }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } }, "gender" : { "terms" : { "field" : "male", "script" : "_value == 'T' ? 'M' : 'F'" }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } } }... Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender
  25. Example - Python DSL from elasticsearch_dsl import Search s =

    Search() s.aggs.bucket('states', 'terms', field='state') \ .bucket('age_groups', 'histogram', field='age', interval=5) \ .metric('grades', 'stats', field='grade') s.aggs['states']['age_groups'] \ .bucket('gender', 'terms', field='gender') \ .metric('grades', 'stats', field='grade') Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender
  26. in near real-time Calculated in one pass

  27. Putting it all together Examples

  28. Faceted Navigation

  29. None
  30. Faceted Search - Python from elasticsearch_dsl import * class LibrarySearch(FacetedSearch):

    doc_types = [Book, Magazine] index = 'library' fields = ['tags', 'title', 'description', 'author.*'] facets = { 'tags': TermsFacet(field='tags'), 'years': DateHistogramFacet( field='published_date', interval='year' ) }
  31. (more @ 16:45) Log Analysis

  32. Kibana (+logstash data)

  33. Recommendations

  34. Example: recommendations Artist A user A artist likes Artist B

    Artist C user B artist likes Artist D Users represented as documents Artists represented as terms
  35. Simple recommendation s = Search() # get users that like

    the same artists s = s.query('terms', artists=user_likes) # get the most popular I don't know yet s.aggs.bucket('popular', 'terms', field='artists', exclude=user_likes) Popular != Relevant
  36. Better recommendation s = Search() # get users that like

    the same artists s = s.query('terms', artists=user_likes) # get the artists that are specific s.aggs.bucket('significant', 'significant_terms', field='artists', exclude=user_likes) Use the relevancy!
  37. Significant terms Use the term stats Compare to background Also

    as nested aggregation
  38. Super-connected nodes in graphs We just figured out the way

    to surf only the meaningful connections in a graph! Concept A Concept B Concept C useful useless
  39. Extras

  40. Percolator Reversed search "Which queries match this document?" Classification Language

    detection Location Alerts Stored search Live search
  41. Suggesters terms, phrase "Did you mean?" context aware completion as-you-type

    FAST! custom score
  42. Distributed model Cluster Collection of Nodes Index Collection of Shards

    Shard Unit of scale Distributed across cluster Primary and replica node 1 orders products 2 1 4 1 node 2 orders products 2 2 node 3 orders 3 4 1 3 products
  43. Honza Král @honzakral Thanks!