Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Explore your data with Elasticsearch

Explore your data with Elasticsearch

Presentation on PyCon HK by Honza Král

Elasticsearch Inc

November 07, 2015
Tweet

More Decks by Elasticsearch Inc

Other Decks in Technology

Transcript

  1. Honza Král

    @honzakral
    Explore your data with

    Elasticsearch

    View Slide

  2. Elasticsearch

    View Slide

  3. Distributed Search Engine
    Open Source


    Document-based


    Based on Lucene


    JSON over HTTP

    View Slide

  4. Document based
    JSON

    Dynamic Schema

    Some Relationships

    Nested

    Parent/Child

    View Slide

  5. {
    "id": 7635,
    "accepted_answer_id": 7641,
    "answer_count": 9,
    "title": "Are you able to close your eyes and focus/think just on your code?",
    "body": "How do I ......?",
    "comment_count": 2,
    "comments": [{
    "creation_date": "2010-09-27T19:31:27.200",
    "id": 9372,
    "owner": { "display_name": "sange", "id": 3092 },
    "post_id": 7635,
    "text": "I sometimes close my eyes or stare at something ....."
    }, {......}],
    "favorite_count": 2,
    "last_activity_date": "2010-09-28T00:28:08.393",
    "owner": { "display_name": "flow", "id": 3761 },
    "rating": 6,
    "tags": [ "focus", "concentration" ],
    "view_count": 368,
    "creation_date": "2010-09-27T19:16:57.757",
    "closed_date": "2011-11-13T12:12:05.937"
    }
    StackOverflow Question

    View Slide

  6. Search

    View Slide

  7. Full Text (unstructured)
    in or across fields

    phrase, fuzzy, ...

    scan api for data extraction

    relies on analysis

    View Slide

  8. Filtering (structured)
    exact matches, ranges, geo, ...

    fast

    cacheable as bitsets

    core filters are cached, not compound filters (bool/and/or)

    View Slide

  9. Under the Hood

    View Slide

  10. Bible concordance
    A simple form lists Biblical words alphabetically, with indications
    to enable the inquirer to find the passages of the Bible where
    the words occur.
    The first concordance, completed in 1230, was undertaken
    under the guidance of Hugo de Saint-Cher (Hugo de Sancto
    Charo), assisted by fellow Dominicans.

    View Slide

  11. Inverted Index

    View Slide

  12. Building an inverted index
    "Django is a high-level Python Web framework that encourages rapid
    development and clean, pragmatic design."
    django high level python
    web framework encourag
    rapid
    develop clean pragmat design
    fast

    View Slide

  13. Inverted index
    python file_1.txt file_2.txt file_3.txt
    web file_2.txt file_3.txt
    file_2.txt file_4.txt
    django
    file_3.txt
    flask
    jazz file_4.txt

    View Slide

  14. search(python AND django)
    python file_1.txt file_2.txt file_3.txt
    file_2.txt file_4.txt
    django
    file_3.txt
    flask
    jazz file_4.txt
    web file_2.txt file_3.txt

    View Slide

  15. Phrase search
    python file_1.txt
    (4)
    file_2.txt
    (1, 3)
    file_3.txt
    (11, 42)
    web file_2.txt
    (2)
    file_3.txt
    (10)

    View Slide

  16. search("python web")
    python file_1.txt
    (4)
    file_2.txt
    (1, 13)
    file_3.txt
    (11, 42)
    web file_2.txt
    (2)
    file_3.txt
    (10)

    View Slide

  17. Merging sorted lists.

    View Slide

  18. Flexible
    Easily distributable

    View Slide

  19. Aggregations

    View Slide

  20. Metrics in Buckets
    Buckets

    split documents into groups

    can be nested

    Metrics

    calculated over documents in given bucket

    View Slide

  21. Buckets
    terms

    bucket per field value - "category"

    significant terms

    terms specific for this bucket - "uncommonly common"

    range

    per range - "age"

    geo_range/geohash_grid

    distance ranges

    (date_)histogram

    buckets per time interval - "daily"

    ...

    View Slide

  22. Metrics
    count/sum/avg/min/max/...

    (extended) stats

    including std deviation, sum of squares etc

    top_hits

    cardinality

    percentiles

    ...

    View Slide

  23. Mix and Match

    View Slide

  24. Example
    "aggs" : {
    "states" : {
    "terms" : {
    "field" : "state"
    },
    "aggs" : {
    "age_groups" : {
    "histogram" : { "field" : "age", "interval" : 5 },
    "aggs" : {
    "grades" : {
    "stats" : { "field" : "grade" }
    },
    "gender" : {
    "terms" : {
    "field" : "male",
    "script" : "_value == 'T' ? 'M' : 'F'"
    },
    "aggs" : {
    "grades" : {
    "stats" : { "field" : "grade" }
    }
    }...
    Analyze the grades per
    state
    Analyze per age_group
    Stats per state &
    age_group
    Stats per state, age_group &
    gender

    View Slide

  25. Example - Python DSL
    from elasticsearch_dsl import Search
    s = Search()
    s.aggs.bucket('states', 'terms', field='state') \
    .bucket('age_groups', 'histogram',
    field='age', interval=5) \
    .metric('grades', 'stats', field='grade')
    s.aggs['states']['age_groups'] \
    .bucket('gender', 'terms', field='gender') \
    .metric('grades', 'stats', field='grade')
    Analyze the grades per
    state
    Analyze per age_group
    Stats per state &
    age_group
    Stats per state, age_group &
    gender

    View Slide

  26. in near real-time
    Calculated in one pass

    View Slide

  27. Putting it all together
    Examples

    View Slide

  28. Faceted Navigation

    View Slide

  29. View Slide

  30. Faceted Search - Python
    from elasticsearch_dsl import *
    class LibrarySearch(FacetedSearch):
    doc_types = [Book, Magazine]
    index = 'library'
    fields = ['tags', 'title', 'description', 'author.*']
    facets = {
    'tags': TermsFacet(field='tags'),
    'years': DateHistogramFacet(
    field='published_date',
    interval='year'
    )
    }

    View Slide

  31. (more @ 16:45)
    Log Analysis

    View Slide

  32. Kibana (+logstash data)

    View Slide

  33. Recommendations

    View Slide

  34. Example: recommendations
    Artist A user A
    artist likes
    Artist B
    Artist C user B
    artist likes
    Artist D
    Users represented as documents
    Artists represented as terms

    View Slide

  35. Simple recommendation
    s = Search()
    # get users that like the same artists
    s = s.query('terms', artists=user_likes)
    # get the most popular I don't know yet
    s.aggs.bucket('popular', 'terms',
    field='artists', exclude=user_likes)
    Popular != Relevant

    View Slide

  36. Better recommendation
    s = Search()
    # get users that like the same artists
    s = s.query('terms', artists=user_likes)
    # get the artists that are specific
    s.aggs.bucket('significant', 'significant_terms',
    field='artists', exclude=user_likes)
    Use the relevancy!

    View Slide

  37. Significant terms
    Use the term stats

    Compare to background

    Also as nested aggregation

    View Slide

  38. Super-connected nodes in graphs
    We just figured out the way to surf only the
    meaningful connections in a graph!
    Concept A Concept B
    Concept C
    useful
    useless

    View Slide

  39. Extras

    View Slide

  40. Percolator
    Reversed search

    "Which queries match this document?"

    Classification

    Language detection

    Location

    Alerts

    Stored search

    Live search

    View Slide

  41. Suggesters
    terms, phrase

    "Did you mean?"

    context aware

    completion

    as-you-type FAST!

    custom score

    View Slide

  42. Distributed model
    Cluster

    Collection of Nodes

    Index

    Collection of Shards

    Shard

    Unit of scale

    Distributed across cluster

    Primary and replica
    node 1
    orders
    products
    2
    1
    4
    1
    node 2
    orders
    products
    2
    2
    node 3
    orders
    3 4
    1
    3
    products

    View Slide

  43. Honza Král

    @honzakral
    Thanks!

    View Slide