Explore your data with Elasticsearch

Honza Král @honzakral Explore your data with Elasticsearch

Elasticsearch

Distributed Search Engine Open Source    Document-based    Based on
Lucene   JSON over HTTP

Document based JSON  Dynamic Schema  Some Relationships Nested Parent/Child

{ "id": 7635, "accepted_answer_id": 7641, "answer_count": 9, "title": "Are you
able to close your eyes and focus/think just on your code?", "body": "How do I ......?", "comment_count": 2, "comments": [{ "creation_date": "2010-09-27T19:31:27.200", "id": 9372, "owner": { "display_name": "sange", "id": 3092 }, "post_id": 7635, "text": "I sometimes close my eyes or stare at something ....." }, {......}], "favorite_count": 2, "last_activity_date": "2010-09-28T00:28:08.393", "owner": { "display_name": "flow", "id": 3761 }, "rating": 6, "tags": [ "focus", "concentration" ], "view_count": 368, "creation_date": "2010-09-27T19:16:57.757", "closed_date": "2011-11-13T12:12:05.937" } StackOverﬂow Question

Search

Full Text (unstructured) in or across ﬁelds phrase, fuzzy, ...
scan api for data extraction relies on analysis

Filtering (structured) exact matches, ranges, geo, ... fast cacheable as
bitsets core ﬁlters are cached, not compound ﬁlters (bool/and/or)

Under the Hood

Bible concordance A simple form lists Biblical words alphabetically, with
indications to enable the inquirer to ﬁnd the passages of the Bible where the words occur. The ﬁrst concordance, completed in 1230, was undertaken under the guidance of Hugo de Saint-Cher (Hugo de Sancto Charo), assisted by fellow Dominicans.

Inverted Index

Building an inverted index "Django is a high-level Python Web
framework that encourages rapid development and clean, pragmatic design." django high level python web framework encourag rapid develop clean pragmat design fast

Inverted index python file_1.txt file_2.txt file_3.txt web file_2.txt file_3.txt file_2.txt
file_4.txt django file_3.txt flask jazz file_4.txt

search(python AND django) python file_1.txt file_2.txt file_3.txt file_2.txt file_4.txt django
file_3.txt flask jazz file_4.txt web file_2.txt file_3.txt

Phrase search python file_1.txt (4) file_2.txt (1, 3) file_3.txt (11,
42) web file_2.txt (2) file_3.txt (10)

search("python web") python file_1.txt (4) file_2.txt (1, 13) file_3.txt (11,
42) web file_2.txt (2) file_3.txt (10)

Merging sorted lists.

Flexible Easily distributable

Aggregations

Metrics in Buckets Buckets split documents into groups can be
nested Metrics calculated over documents in given bucket

Buckets terms bucket per field value - "category" significant terms
terms specific for this bucket - "uncommonly common" range per range - "age" geo_range/geohash_grid distance ranges (date_)histogram buckets per time interval - "daily" ...

Metrics count/sum/avg/min/max/... (extended) stats including std deviation, sum of squares
etc top_hits cardinality percentiles ...

Mix and Match

Example "aggs" : { "states" : { "terms" : {
"field" : "state" }, "aggs" : { "age_groups" : { "histogram" : { "field" : "age", "interval" : 5 }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } }, "gender" : { "terms" : { "field" : "male", "script" : "_value == 'T' ? 'M' : 'F'" }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } } }... Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender

Example - Python DSL from elasticsearch_dsl import Search s =
Search() s.aggs.bucket('states', 'terms', field='state') \ .bucket('age_groups', 'histogram', field='age', interval=5) \ .metric('grades', 'stats', field='grade') s.aggs['states']['age_groups'] \ .bucket('gender', 'terms', field='gender') \ .metric('grades', 'stats', field='grade') Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender

in near real-time Calculated in one pass

Putting it all together Examples

Faceted Navigation

Faceted Search - Python from elasticsearch_dsl import * class LibrarySearch(FacetedSearch):
doc_types = [Book, Magazine] index = 'library' fields = ['tags', 'title', 'description', 'author.*'] facets = { 'tags': TermsFacet(field='tags'), 'years': DateHistogramFacet( field='published_date', interval='year' ) }

(more @ 16:45) Log Analysis

Kibana (+logstash data)

Recommendations

Example: recommendations Artist A user A artist likes Artist B
Artist C user B artist likes Artist D Users represented as documents Artists represented as terms

Simple recommendation s = Search() # get users that like
the same artists s = s.query('terms', artists=user_likes) # get the most popular I don't know yet s.aggs.bucket('popular', 'terms', field='artists', exclude=user_likes) Popular != Relevant

Better recommendation s = Search() # get users that like
the same artists s = s.query('terms', artists=user_likes) # get the artists that are specific s.aggs.bucket('significant', 'significant_terms', field='artists', exclude=user_likes) Use the relevancy!

Signiﬁcant terms Use the term stats Compare to background Also
as nested aggregation

Super-connected nodes in graphs We just figured out the way
to surf only the meaningful connections in a graph! Concept A Concept B Concept C useful useless

Extras

Percolator Reversed search "Which queries match this document?" Classiﬁcation Language
detection Location Alerts Stored search Live search

Suggesters terms, phrase "Did you mean?" context aware completion as-you-type
FAST! custom score

Distributed model Cluster Collection of Nodes Index Collection of Shards
Shard Unit of scale Distributed across cluster Primary and replica node 1 orders products 2 1 4 1 node 2 orders products 2 2 node 3 orders 3 4 1 3 products

Honza Král @honzakral Thanks!

Explore your data with Elasticsearch

Explore your data with Elasticsearch

More Decks by Elasticsearch Inc

Other Decks in Technology

Featured

Transcript