Elasticsearch - Speed is key

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
April 23, 2015

Elasticsearch - Speed is key

This talk moves beyond the standard introduction to Elasticsearch and focuses on how Elasticsearch tries to fulfill its near real-time contract. Specifically, I’ll show how Elasticsearch manages to be incredibly fast while handling huge amounts of data. After a quick introduction, we will walk through several search features and how the user can get the most out of Elasticsearch.

This talk will go under the hood exploring features like search, aggregations, highlighting, (non-)use of probabilistic data structures and give a quick outlook into future Elasticsearch and Lucene releases.

Presented at CraftConf 2015, http://craft-conf.com"

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

April 23, 2015
Tweet

Transcript

  1. Elasticsearch - Speed is key Alexander Reelsen @spinscale

  2. www.elastic.co 2 • Introduction • Elasticsearch • Speed by Example

    • Search • Aggregations • Operating System • Distributed aspects Agenda
  3. www.elastic.co 3 • Me • joined in march 2013 •

    working on Elasticsearch & Shield • Interested in all things scale & search About • Elastic • Founded in 2012 • Behind: Elasticsearch, Logstash, Kibana, Marvel, Shield, ES for Hadoop, Elasticsearch clients • Support subscriptions • Public & private trainings
  4. www.elastic.co 4 • Me • joined in march 2013 •

    working on Elasticsearch & Shield • Interested in all things scale & search About • Elastic • Founded in 2012 • Behind: Elasticsearch, Logstash, Kibana, Marvel, Shield, ES for Hadoop, Elasticsearch clients • Support subscriptions • Public & private trainings
  5. www.elastic.co 5 Elasticsearch - High level overview Introduction

  6. www.elastic.co 6 an open source, distributed, scalable, highly available, document-oriented,

    RESTful full text search engine with real-time search and analytics capabilities Elasticsearch is...
  7. www.elastic.co 7 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is... Apache 2.0 License https://www.apache.org/licenses/LICENSE-2.0
  8. www.elastic.co 8 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...
  9. www.elastic.co 9 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...
  10. www.elastic.co 10 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...
  11. www.elastic.co 11 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is... { "name" : "Craft" "geo" : { "city" : "Budapest", "lat" : 47.49, "lon" : 19.04 } } Source:  http://json.org/
  12. www.elastic.co 12 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is... Source:  https://httpwg.github.io/asset/http.svg
  13. www.elastic.co 13 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...
  14. www.elastic.co 14 an open source, distributed, scalable, highly available, document-oriented,

    RESTful, full text search engine with real-time search and analytics capabilities Elasticsearch is...
  15. www.elastic.co 15 Getting up and running... is easy # wget

    https://download.elastic.co/elasticsearch/ elasticsearch/elasticsearch-1.5.1.zip # unzip elasticsearch-1.5.1.zip # cd elasticsearch-1.5.1 # ./bin/elasticsearch # curl http://localhost:9200
  16. www.elastic.co 16 Scaling

  17. www.elastic.co 17 Cluster: A collection of nodes node cluster

  18. www.elastic.co 18 Cluster: A collection of nodes node cluster node

  19. www.elastic.co cluster 19 Cluster: A collection of nodes node node

  20. www.elastic.co 20 Cluster: A collection of nodes node node node

    node cluster
  21. www.elastic.co 21 Shards: Unit of scale a0 a1 a2 a3

    # curl -X PUT http://localhost:9200/a -d ' { "index.number_of_shards" : 4 }'
  22. www.elastic.co 22 Shards: Unit of scale a0 a1 a2 a3

    # curl -X PUT http://localhost:9200/a -d ' { "index.number_of_shards" : 4 }'
  23. www.elastic.co 23 Shards: Unit of scale a0 a1 a2 a3

    # curl -X PUT http://localhost:9200/a/_settings -d ' { "index.number_of_replicas" : 1 }'
  24. www.elastic.co 24 Replication a0 a1 a2 a3 # curl -X

    PUT http://localhost:9200/a/_settings -d ' { "index.number_of_replicas" : 1 }' a0 a3 a2 a1
  25. www.elastic.co 25 Search

  26. www.elastic.co 26 CRUD PUT books/book/1 { "name" : "Elasticsearch -

    The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" }
  27. www.elastic.co 27 CRUD PUT books/book/1 { "name" : "Elasticsearch -

    The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } GET books/book/1
  28. www.elastic.co 28 CRUD PUT books/book/1 { "name" : "Elasticsearch -

    The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } GET books/book/1 DELETE books/book/1
  29. www.elastic.co 29 CRUD PUT books/book/1 { "name" : "Elasticsearch -

    The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } GET books/book/1 DELETE books/book/1 GET books/book/_search?q=elasticsearch
  30. www.elastic.co 30 Searching GET books/book/_search { "query" : { "filtered"

    : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }} }
  31. www.elastic.co 31 Searching GET books/book/_search { "query" : { "filtered"

    : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }} } { "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.15342641, "hits": [ { "_index": "books", "_type": "book", "_id": "1", "_score": 0.15342641, "_source": { "name": "Elasticsearch - The definitive guide", "authors": [ "Clinton Gormley", "Zachary Tong" ], "pages": 722, "category": "search" "published_at": "2015/01/31", } } ] } }
  32. www.elastic.co 32 Searching GET books/book/_search { "query" : { "filtered"

    : { "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }}, "aggs" : { "category" : { "terms" : { "field" : "category" } } } }
  33. www.elastic.co GET books/book/_search { "query" : { "filtered" : {

    "query" : { "match" : { "name" : "elasticsearch" }}, "filter" : { "range" : { "published_at" : { "gte" : "now-1y" } } } }}, "aggs" : { "category" : { "terms" : { "field" : "category" } } } } 33 Searching { "took": 3, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.15342641, "hits": [ ... ] }, "aggregations": { "category": { "buckets": [ { "key": "search", "doc_count": 1 }, { ... } ] } } }
  34. www.elastic.co 34 • Hits all relevant shards • Searches for

    top-N results per shard • Reduces to top-N total • Gets top-N documents/data from relevant shards • Returns data to requesting client Search
  35. www.elastic.co 35 • Lucene is doing the heavy lifting •

    A single shard is a Lucene index • Each field is its own inverted index and can be searched in Search on a single shard term docid clinton 1 gormley 1 tong 1 zachary 1
  36. www.elastic.co 36 • It is very hard to reconstruct the

    original data from the inverted index • Solution: Just store the whole document in its own field and retrieve it, when returning data to the client Example: Return original JSON in search response { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } _source
  37. www.elastic.co Elasticsearch - The definitive guide Clinton Gormley Zachary Tong

    722 2015/01/31 37 Example: _all field { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } _all
  38. www.elastic.co Elasticsearch - The definitive guide Clinton Gormley Zachary Tong

    38 Example: copy_to field (name & authors) { "name" : "Elasticsearch - The definitive guide", "authors" : [ "Clinton Gormley", "Zachary Tong" ], "pages" : 722, "published_at" : "2015/01/31" } copy_to
  39. www.elastic.co 39 • Filters do not contribute to score &

    can be cached using a BitSet • range filter for a date/price • term filter for a category • geo filter for a bounding box Search: Using filters 0 1 1 0 0 1 { "term" : { "category" : "search" } } 0 1 0 0 1 0 { "term" : { "category" : "reduced" } }
  40. www.elastic.co 40 • Problem: How to search in an inverted

    index for non-existing fields (exists & missing filter)? • Costly: Need to merge postings lists of all existing terms (expensive for high-cardinality fields!) • Solution: Index document field names under _field_names Filters: Missing fields
  41. www.elastic.co 41 Aggregations

  42. www.elastic.co 42 • Aggregations: Buckets & metrics • Aggregations cannot

    make use of the inverted index • Meet Fielddata: Uninverting the index • Inverted index: Maps term to document id • Fielddata: Maps document id to terms Aggregations
  43. www.elastic.co 43 Aggregations docid term 1 Clinton Gormley, Zachary Tong

    Inverted Index Fielddata term docid clinton 1 gormley 1 tong 1 zachary 1
  44. www.elastic.co 44 Aggregations docid term 1 Clinton Gormley, Zachary Tong

    Inverted Index Fielddata term docid Clinton Gormley 1 Zachary Tong 1
  45. www.elastic.co 45 • Fielddata is an in-memory data structure, lazily

    constructed • Easy to go OOM (wrong field or too many documents) • Solution: • circuit breaker • doc_values: index-time data structure, no heap, uses the file system cache, better compression Aggregations: Fielddata
  46. www.elastic.co 46 • Problem: Count distinct elements • Naive: Load

    all data into a set, then check the size (distributed?) • Solution: cardinality Aggregation, that uses HyperLogLog++ • configurable precision, allows to trade memory for accuracy • excellent accuracy on low-cardinality sets • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on configured precision Aggregations: Probabilistic data structures
  47. www.elastic.co 47 • Problem: Calculate percentiles • Naive: Maintain a

    sorted list of all values • Solution: percentiles Aggregation, that uses T-Digest • extreme percentiles are more accurate • small sets can be up to 100% accurate • while values are added to a bucket, the algorithm trades accuracy for memory savings Aggregations: Probabilistic data structures
  48. www.elastic.co 48 Operating system & Hardware

  49. www.elastic.co 49 • CPU Indexing, searching, highlighting • I/O Indexing,

    searching, merging • Memory Aggregation, indices • Network Relocation, Snapshot & Restore Elasticsearch can easily max out...
  50. www.elastic.co 50 • CPU: Threadpools are sized on number of

    cores • Disk: SSD • Memory: ∞ • Network: GbE or better Hardware
  51. www.elastic.co 51 • file system cache • file handles •

    memory locking: bootstrap.mlockall • dont swap, no OOM killer Operating system
  52. www.elastic.co 52 Distributed aspects

  53. www.elastic.co 53 • The network is reliable • Latency is

    zero • Bandwidth is infinite • The network is secure Fallacies of distributed computing • Topology doesn't change • There is one administrator • Transport cost is zero • The network is homogeneous by  Peter  Deutsch   https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
  54. www.elastic.co 54 Wrapup

  55. www.elastic.co 55 • Speed is key! • Search is a

    tradeoff: Query time vs. index time • Benchmark your use-case • http://benchmarks.elasticsearch.org/ Summary
  56. www.elastic.co 56 • Automatic I/O throttling • Clusterstate incremental updates

    • Faster recovery • Aggregations 2.0 • Merge queries and filters • Reindex API • Changes API • Expression scripting engine Elasticsearch 2.x
  57. www.elastic.co 57 • Speed improvements in queries (must_not, sloppy phrase)

    • Automated caching • BitSet compression vastly improved (roaring bitsets) • Index compression (on disk + memory) • Indexing performance (adaptive merge throttling, SSD detection) • Index safety: atomic commits, segment commit identifiers, verify integrity at merge • ... Lucene 5.x
  58. www.elastic.co 58 Alexander Reelsen alex@elastic.co @spinscale Thanks for listening! Questions?

    We’re  hiring   https://www.elastic.co/about/careers   We’re  helping   https://www.elastic.co/subscriptions
  59. www.elastic.co 59 http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-field-names-field.html http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html http://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html http://www.elastic.co/guide/en/elasticsearch/guide/current/percentiles.html http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html

    http://www.elastic.co/elasticon/2015/sf/elasticsearch-architecture-amusing-algorithms-and-details-on-data-structures/ http://speakerdeck.com/elastic/all-about-aggregations http://www.elastic.co/elasticon/2015/sf/updates-from-lucene-land http://speakerdeck.com/elastic/resiliency-in-elasticsearch-and-lucene http://www.elastic.co/elasticon/2015/sf/level-up-your-clusters-upgrading-elasticsearch http://speakerdeck.com/elasticsearch/maintaining-performance-in-distributed-systems References