Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Explore your data with Elasticsearch

Explore your data with Elasticsearch

Presentation on PyCon HK by Honza Král

Elasticsearch Inc

November 07, 2015
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. { "id": 7635, "accepted_answer_id": 7641, "answer_count": 9, "title": "Are you

    able to close your eyes and focus/think just on your code?", "body": "How do I ......?", "comment_count": 2, "comments": [{ "creation_date": "2010-09-27T19:31:27.200", "id": 9372, "owner": { "display_name": "sange", "id": 3092 }, "post_id": 7635, "text": "I sometimes close my eyes or stare at something ....." }, {......}], "favorite_count": 2, "last_activity_date": "2010-09-28T00:28:08.393", "owner": { "display_name": "flow", "id": 3761 }, "rating": 6, "tags": [ "focus", "concentration" ], "view_count": 368, "creation_date": "2010-09-27T19:16:57.757", "closed_date": "2011-11-13T12:12:05.937" } StackOverflow Question
  2. Full Text (unstructured) in or across fields phrase, fuzzy, ...

    scan api for data extraction relies on analysis
  3. Filtering (structured) exact matches, ranges, geo, ... fast cacheable as

    bitsets core filters are cached, not compound filters (bool/and/or)
  4. Bible concordance A simple form lists Biblical words alphabetically, with

    indications to enable the inquirer to find the passages of the Bible where the words occur. The first concordance, completed in 1230, was undertaken under the guidance of Hugo de Saint-Cher (Hugo de Sancto Charo), assisted by fellow Dominicans.
  5. Building an inverted index "Django is a high-level Python Web

    framework that encourages rapid development and clean, pragmatic design." django high level python web framework encourag rapid develop clean pragmat design fast
  6. Metrics in Buckets Buckets split documents into groups can be

    nested Metrics calculated over documents in given bucket
  7. Buckets terms bucket per field value - "category" significant terms

    terms specific for this bucket - "uncommonly common" range per range - "age" geo_range/geohash_grid distance ranges (date_)histogram buckets per time interval - "daily" ...
  8. Example "aggs" : { "states" : { "terms" : {

    "field" : "state" }, "aggs" : { "age_groups" : { "histogram" : { "field" : "age", "interval" : 5 }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } }, "gender" : { "terms" : { "field" : "male", "script" : "_value == 'T' ? 'M' : 'F'" }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } } }... Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender
  9. Example - Python DSL from elasticsearch_dsl import Search s =

    Search() s.aggs.bucket('states', 'terms', field='state') \ .bucket('age_groups', 'histogram', field='age', interval=5) \ .metric('grades', 'stats', field='grade') s.aggs['states']['age_groups'] \ .bucket('gender', 'terms', field='gender') \ .metric('grades', 'stats', field='grade') Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender
  10. Faceted Search - Python from elasticsearch_dsl import * class LibrarySearch(FacetedSearch):

    doc_types = [Book, Magazine] index = 'library' fields = ['tags', 'title', 'description', 'author.*'] facets = { 'tags': TermsFacet(field='tags'), 'years': DateHistogramFacet( field='published_date', interval='year' ) }
  11. Example: recommendations Artist A user A artist likes Artist B

    Artist C user B artist likes Artist D Users represented as documents Artists represented as terms
  12. Simple recommendation s = Search() # get users that like

    the same artists s = s.query('terms', artists=user_likes) # get the most popular I don't know yet s.aggs.bucket('popular', 'terms', field='artists', exclude=user_likes) Popular != Relevant
  13. Better recommendation s = Search() # get users that like

    the same artists s = s.query('terms', artists=user_likes) # get the artists that are specific s.aggs.bucket('significant', 'significant_terms', field='artists', exclude=user_likes) Use the relevancy!
  14. Super-connected nodes in graphs We just figured out the way

    to surf only the meaningful connections in a graph! Concept A Concept B Concept C useful useless
  15. Distributed model Cluster Collection of Nodes Index Collection of Shards

    Shard Unit of scale Distributed across cluster Primary and replica node 1 orders products 2 1 4 1 node 2 orders products 2 2 node 3 orders 3 4 1 3 products