Slide 1

Slide 1 text

Elasticsearch - Speed is key Alexander Reelsen @spinscale

Slide 2

Slide 2 text

www.elastic.co 2 • Introduction • Elasticsearch • Speed by Example • Search • Aggregations • Operating System • Distributed aspects Agenda

Slide 3

Slide 3 text

www.elastic.co 3 • Me • joined in march 2013 • working on Elasticsearch & Shield • Interested in all things scale & search About • Elastic • Founded in 2012 • Behind: Elasticsearch, Logstash, Kibana, Marvel, Shield, ES for Hadoop, Elasticsearch clients • Support subscriptions • Public & private trainings

Slide 4

Slide 4 text

www.elastic.co 4 • Me • joined in march 2013 • working on Elasticsearch & Shield • Interested in all things scale & search About • Elastic • Founded in 2012 • Behind: Elasticsearch, Logstash, Kibana, Marvel, Shield, ES for Hadoop, Elasticsearch clients • Support subscriptions • Public & private trainings

Slide 5

Slide 5 text

www.elastic.co 5 Elasticsearch - High level overview Introduction

Slide 6

Slide 6 text

www.elastic.co 6 an open source, distributed, scalable, highly available, document-oriented, RESTful full text search engine with real-time search and analytics capabilities Elasticsearch is...

Slide 7

Slide 7 text

www.elastic.co 7 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is... Apache 2.0 License https://www.apache.org/licenses/LICENSE-2.0

Slide 8

Slide 8 text

www.elastic.co 8 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...

Slide 9

Slide 9 text

www.elastic.co 9 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...

Slide 10

Slide 10 text

www.elastic.co 10 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...

Slide 11

Slide 11 text

www.elastic.co 11 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is... { "name" : "Craft" "geo" : { "city" : "Budapest", "lat" : 47.49, "lon" : 19.04 } } Source:  http://json.org/

Slide 12

Slide 12 text

www.elastic.co 12 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is... Source:  https://httpwg.github.io/asset/http.svg

Slide 13

Slide 13 text

www.elastic.co 13 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...

Slide 14

Slide 14 text

www.elastic.co 14 an open source, distributed, scalable, highly available, document-oriented, RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...

Slide 15

Slide 15 text

www.elastic.co 15 Getting up and running... is easy # wget https://download.elastic.co/elasticsearch/ elasticsearch/elasticsearch-1.5.1.zip # unzip elasticsearch-1.5.1.zip # cd elasticsearch-1.5.1 # ./bin/elasticsearch # curl http://localhost:9200

Slide 16

Slide 16 text

www.elastic.co 16 Scaling

Slide 17

Slide 17 text

www.elastic.co 17 Cluster: A collection of nodes node cluster

Slide 18

Slide 18 text

www.elastic.co 18 Cluster: A collection of nodes node cluster node

Slide 19

Slide 19 text

www.elastic.co cluster 19 Cluster: A collection of nodes node node

Slide 20

Slide 20 text

www.elastic.co 20 Cluster: A collection of nodes node node node node cluster

Slide 21

Slide 21 text

www.elastic.co 21 Shards: Unit of scale a0 a1 a2 a3 # curl -X PUT http://localhost:9200/a -d ' { "index.number_of_shards" : 4 }'

Slide 22

Slide 22 text

www.elastic.co 22 Shards: Unit of scale a0 a1 a2 a3 # curl -X PUT http://localhost:9200/a -d ' { "index.number_of_shards" : 4 }'

Slide 23

Slide 23 text

www.elastic.co 23 Shards: Unit of scale a0 a1 a2 a3 # curl -X PUT http://localhost:9200/a/_settings -d ' { "index.number_of_replicas" : 1 }'

Slide 24

Slide 24 text

www.elastic.co 24 Replication a0 a1 a2 a3 # curl -X PUT http://localhost:9200/a/_settings -d ' { "index.number_of_replicas" : 1 }' a0 a3 a2 a1

Slide 25

Slide 25 text

www.elastic.co 25 Search

Slide 26

Slide 26 text

www.elastic.co 26 CRUD PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }

Slide 27

Slide 27 text

www.elastic.co 27 CRUD PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } GET books/book/1

Slide 28

Slide 28 text

www.elastic.co 28 CRUD PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } GET books/book/1 DELETE books/book/1

Slide 29

Slide 29 text

www.elastic.co 29 CRUD PUT books/book/1 { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } GET books/book/1 DELETE books/book/1 GET books/book/_search?q=elasticsearch

Slide 30

Slide 30 text

www.elastic.co 30 Searching GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }} }

Slide 31

Slide 31 text

www.elastic.co 31 Searching GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }} } { "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.15342641, "hits": [ { "_index": "books", "_type": "book", "_id": "1", "_score": 0.15342641, "_source": { "name": "Elasticsearch - The definitive guide", "authors": [ "Clinton Gormley", "Zachary Tong" ], "pages": 722, "category": "search" "published_at": "2015/01/31", } } ] } }

Slide 32

Slide 32 text

www.elastic.co 32 Searching GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }}, "aggs" : { "category" : { "terms" : { "field" : "category" } } } }

Slide 33

Slide 33 text

www.elastic.co GET books/book/_search { "query" : { "filtered" : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }}, "aggs" : { "category" : { "terms" : { "field" : "category" } } } } 33 Searching { "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.15342641, "hits": [ ... ] }, "aggregations": { "category": { "buckets": [ { "key": "search", "doc_count": 1 }, { ... } ] } } }

Slide 34

Slide 34 text

www.elastic.co 34 • Hits all relevant shards • Searches for top-N results per shard • Reduces to top-N total • Gets top-N documents/data from relevant shards • Returns data to requesting client Search

Slide 35

Slide 35 text

www.elastic.co 35 • Lucene is doing the heavy lifting • A single shard is a Lucene index • Each field is its own inverted index and can be searched in Search on a single shard term docid clinton 1 gormley 1 tong 1 zachary 1

Slide 36

Slide 36 text

www.elastic.co 36 • It is very hard to reconstruct the original data from the inverted index • Solution: Just store the whole document in its own field and retrieve it, when returning data to the client Example: Return original JSON in search response { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } _source

Slide 37

Slide 37 text

www.elastic.co Elasticsearch - The definitive guide Clinton Gormley Zachary Tong 722 2015/01/31 37 Example: _all field { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } _all

Slide 38

Slide 38 text

www.elastic.co Elasticsearch - The definitive guide Clinton Gormley Zachary Tong 38 Example: copy_to field (name & authors) { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } copy_to

Slide 39

Slide 39 text

www.elastic.co 39 • Filters do not contribute to score & can be cached using a BitSet • range filter for a date/price • term filter for a category • geo filter for a bounding box Search: Using filters 0 1 1 0 0 1 { "term" : { "category" : "search" } } 0 1 0 0 1 0 { "term" : { "category" : "reduced" } }

Slide 40

Slide 40 text

www.elastic.co 40 • Problem: How to search in an inverted index for non-existing fields (exists & missing filter)? • Costly: Need to merge postings lists of all existing terms (expensive for high-cardinality fields!) • Solution: Index document field names under _field_names Filters: Missing fields

Slide 41

Slide 41 text

www.elastic.co 41 Aggregations

Slide 42

Slide 42 text

www.elastic.co 42 • Aggregations: Buckets & metrics • Aggregations cannot make use of the inverted index • Meet Fielddata: Uninverting the index • Inverted index: Maps term to document id • Fielddata: Maps document id to terms Aggregations

Slide 43

Slide 43 text

www.elastic.co 43 Aggregations docid term 1 Clinton Gormley, Zachary Tong Inverted Index Fielddata term docid clinton 1 gormley 1 tong 1 zachary 1

Slide 44

Slide 44 text

www.elastic.co 44 Aggregations docid term 1 Clinton Gormley, Zachary Tong Inverted Index Fielddata term docid Clinton Gormley 1 Zachary Tong 1

Slide 45

Slide 45 text

www.elastic.co 45 • Fielddata is an in-memory data structure, lazily constructed • Easy to go OOM (wrong field or too many documents) • Solution: • circuit breaker • doc_values: index-time data structure, no heap, uses the file system cache, better compression Aggregations: Fielddata

Slide 46

Slide 46 text

www.elastic.co 46 • Problem: Count distinct elements • Naive: Load all data into a set, then check the size (distributed?) • Solution: cardinality Aggregation, that uses HyperLogLog++ • configurable precision, allows to trade memory for accuracy • excellent accuracy on low-cardinality sets • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on configured precision Aggregations: Probabilistic data structures

Slide 47

Slide 47 text

www.elastic.co 47 • Problem: Calculate percentiles • Naive: Maintain a sorted list of all values • Solution: percentiles Aggregation, that uses T-Digest • extreme percentiles are more accurate • small sets can be up to 100% accurate • while values are added to a bucket, the algorithm trades accuracy for memory savings Aggregations: Probabilistic data structures

Slide 48

Slide 48 text

www.elastic.co 48 Operating system & Hardware

Slide 49

Slide 49 text

www.elastic.co 49 • CPU Indexing, searching, highlighting • I/O Indexing, searching, merging • Memory Aggregation, indices • Network Relocation, Snapshot & Restore Elasticsearch can easily max out...

Slide 50

Slide 50 text

www.elastic.co 50 • CPU: Threadpools are sized on number of cores • Disk: SSD • Memory: ∞ • Network: GbE or better Hardware

Slide 51

Slide 51 text

www.elastic.co 51 • file system cache • file handles • memory locking: bootstrap.mlockall • dont swap, no OOM killer Operating system

Slide 52

Slide 52 text

www.elastic.co 52 Distributed aspects

Slide 53

Slide 53 text

www.elastic.co 53 • The network is reliable • Latency is zero • Bandwidth is infinite • The network is secure Fallacies of distributed computing • Topology doesn't change • There is one administrator • Transport cost is zero • The network is homogeneous by  Peter  Deutsch   https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

Slide 54

Slide 54 text

www.elastic.co 54 Wrapup

Slide 55

Slide 55 text

www.elastic.co 55 • Speed is key! • Search is a tradeoff: Query time vs. index time • Benchmark your use-case • http://benchmarks.elasticsearch.org/ Summary

Slide 56

Slide 56 text

www.elastic.co 56 • Automatic I/O throttling • Clusterstate incremental updates • Faster recovery • Aggregations 2.0 • Merge queries and filters • Reindex API • Changes API • Expression scripting engine Elasticsearch 2.x

Slide 57

Slide 57 text

www.elastic.co 57 • Speed improvements in queries (must_not, sloppy phrase) • Automated caching • BitSet compression vastly improved (roaring bitsets) • Index compression (on disk + memory) • Indexing performance (adaptive merge throttling, SSD detection) • Index safety: atomic commits, segment commit identifiers, verify integrity at merge • ... Lucene 5.x

Slide 58

Slide 58 text

www.elastic.co 58 Alexander Reelsen [email protected] @spinscale Thanks for listening! Questions? We’re  hiring   https://www.elastic.co/about/careers   We’re  helping   https://www.elastic.co/subscriptions

Slide 59

Slide 59 text

www.elastic.co 59 http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-field-names-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html http://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html http://www.elastic.co/guide/en/elasticsearch/guide/current/percentiles.html http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html http://www.elastic.co/elasticon/2015/sf/elasticsearch-architecture-amusing-algorithms-and-details-on-data-structures/ http://speakerdeck.com/elastic/all-about-aggregations http://www.elastic.co/elasticon/2015/sf/updates-from-lucene-land http://speakerdeck.com/elastic/resiliency-in-elasticsearch-and-lucene http://www.elastic.co/elasticon/2015/sf/level-up-your-clusters-upgrading-elasticsearch http://speakerdeck.com/elasticsearch/maintaining-performance-in-distributed-systems References