Slide 1

Slide 1 text

Honza Král @honzakral Explore your data with Elasticsearch

Slide 2

Slide 2 text

Elasticsearch

Slide 3

Slide 3 text

Distributed Search Engine Open Source
 
 Document-based
 
 Based on Lucene 
 JSON over HTTP

Slide 4

Slide 4 text

Document based JSON
 Dynamic Schema
 Some Relationships Nested Parent/Child

Slide 5

Slide 5 text

{ "id": 7635, "accepted_answer_id": 7641, "answer_count": 9, "title": "Are you able to close your eyes and focus/think just on your code?", "body": "How do I ......?", "comment_count": 2, "comments": [{ "creation_date": "2010-09-27T19:31:27.200", "id": 9372, "owner": { "display_name": "sange", "id": 3092 }, "post_id": 7635, "text": "I sometimes close my eyes or stare at something ....." }, {......}], "favorite_count": 2, "last_activity_date": "2010-09-28T00:28:08.393", "owner": { "display_name": "flow", "id": 3761 }, "rating": 6, "tags": [ "focus", "concentration" ], "view_count": 368, "creation_date": "2010-09-27T19:16:57.757", "closed_date": "2011-11-13T12:12:05.937" } StackOverflow Question

Slide 6

Slide 6 text

Search

Slide 7

Slide 7 text

Full Text (unstructured) in or across fields phrase, fuzzy, ... scan api for data extraction relies on analysis

Slide 8

Slide 8 text

Filtering (structured) exact matches, ranges, geo, ... fast cacheable as bitsets core filters are cached, not compound filters (bool/and/or)

Slide 9

Slide 9 text

Under the Hood

Slide 10

Slide 10 text

Bible concordance A simple form lists Biblical words alphabetically, with indications to enable the inquirer to find the passages of the Bible where the words occur. The first concordance, completed in 1230, was undertaken under the guidance of Hugo de Saint-Cher (Hugo de Sancto Charo), assisted by fellow Dominicans.

Slide 11

Slide 11 text

Inverted Index

Slide 12

Slide 12 text

Building an inverted index "Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design." django high level python web framework encourag rapid develop clean pragmat design fast

Slide 13

Slide 13 text

Inverted index python file_1.txt file_2.txt file_3.txt web file_2.txt file_3.txt file_2.txt file_4.txt django file_3.txt flask jazz file_4.txt

Slide 14

Slide 14 text

search(python AND django) python file_1.txt file_2.txt file_3.txt file_2.txt file_4.txt django file_3.txt flask jazz file_4.txt web file_2.txt file_3.txt

Slide 15

Slide 15 text

Phrase search python file_1.txt (4) file_2.txt (1, 3) file_3.txt (11, 42) web file_2.txt (2) file_3.txt (10)

Slide 16

Slide 16 text

search("python web") python file_1.txt (4) file_2.txt (1, 13) file_3.txt (11, 42) web file_2.txt (2) file_3.txt (10)

Slide 17

Slide 17 text

Merging sorted lists.

Slide 18

Slide 18 text

Flexible Easily distributable

Slide 19

Slide 19 text

Aggregations

Slide 20

Slide 20 text

Metrics in Buckets Buckets split documents into groups can be nested Metrics calculated over documents in given bucket

Slide 21

Slide 21 text

Buckets terms bucket per field value - "category" significant terms terms specific for this bucket - "uncommonly common" range per range - "age" geo_range/geohash_grid distance ranges (date_)histogram buckets per time interval - "daily" ...

Slide 22

Slide 22 text

Metrics count/sum/avg/min/max/... (extended) stats including std deviation, sum of squares etc top_hits cardinality percentiles ...

Slide 23

Slide 23 text

Mix and Match

Slide 24

Slide 24 text

Example "aggs" : { "states" : { "terms" : { "field" : "state" }, "aggs" : { "age_groups" : { "histogram" : { "field" : "age", "interval" : 5 }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } }, "gender" : { "terms" : { "field" : "male", "script" : "_value == 'T' ? 'M' : 'F'" }, "aggs" : { "grades" : { "stats" : { "field" : "grade" } } }... Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender

Slide 25

Slide 25 text

Example - Python DSL from elasticsearch_dsl import Search s = Search() s.aggs.bucket('states', 'terms', field='state') \ .bucket('age_groups', 'histogram', field='age', interval=5) \ .metric('grades', 'stats', field='grade') s.aggs['states']['age_groups'] \ .bucket('gender', 'terms', field='gender') \ .metric('grades', 'stats', field='grade') Analyze the grades per state Analyze per age_group Stats per state & age_group Stats per state, age_group & gender

Slide 26

Slide 26 text

in near real-time Calculated in one pass

Slide 27

Slide 27 text

Putting it all together Examples

Slide 28

Slide 28 text

Faceted Navigation

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Faceted Search - Python from elasticsearch_dsl import * class LibrarySearch(FacetedSearch): doc_types = [Book, Magazine] index = 'library' fields = ['tags', 'title', 'description', 'author.*'] facets = { 'tags': TermsFacet(field='tags'), 'years': DateHistogramFacet( field='published_date', interval='year' ) }

Slide 31

Slide 31 text

(more @ 16:45) Log Analysis

Slide 32

Slide 32 text

Kibana (+logstash data)

Slide 33

Slide 33 text

Recommendations

Slide 34

Slide 34 text

Example: recommendations Artist A user A artist likes Artist B Artist C user B artist likes Artist D Users represented as documents Artists represented as terms

Slide 35

Slide 35 text

Simple recommendation s = Search() # get users that like the same artists s = s.query('terms', artists=user_likes) # get the most popular I don't know yet s.aggs.bucket('popular', 'terms', field='artists', exclude=user_likes) Popular != Relevant

Slide 36

Slide 36 text

Better recommendation s = Search() # get users that like the same artists s = s.query('terms', artists=user_likes) # get the artists that are specific s.aggs.bucket('significant', 'significant_terms', field='artists', exclude=user_likes) Use the relevancy!

Slide 37

Slide 37 text

Significant terms Use the term stats Compare to background Also as nested aggregation

Slide 38

Slide 38 text

Super-connected nodes in graphs We just figured out the way to surf only the meaningful connections in a graph! Concept A Concept B Concept C useful useless

Slide 39

Slide 39 text

Extras

Slide 40

Slide 40 text

Percolator Reversed search "Which queries match this document?" Classification Language detection Location Alerts Stored search Live search

Slide 41

Slide 41 text

Suggesters terms, phrase "Did you mean?" context aware completion as-you-type FAST! custom score

Slide 42

Slide 42 text

Distributed model Cluster Collection of Nodes Index Collection of Shards Shard Unit of scale Distributed across cluster Primary and replica node 1 orders products 2 1 4 1 node 2 orders products 2 2 node 3 orders 3 4 1 3 products

Slide 43

Slide 43 text

Honza Král @honzakral Thanks!