Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch - Speed is key

Elastic Co
April 23, 2015

Elasticsearch - Speed is key

This talk moves beyond the standard introduction to Elasticsearch and focuses on how Elasticsearch tries to fulfill its near real-time contract. Specifically, I’ll show how Elasticsearch manages to be incredibly fast while handling huge amounts of data. After a quick introduction, we will walk through several search features and how the user can get the most out of Elasticsearch.

This talk will go under the hood exploring features like search, aggregations, highlighting, (non-)use of probabilistic data structures and give a quick outlook into future Elasticsearch and Lucene releases.

Presented at CraftConf 2015, http://craft-conf.com"

Elastic Co

April 23, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Elasticsearch - Speed is key
    Alexander Reelsen
    @spinscale

    View Slide

  2. www.elastic.co
    2
    • Introduction
    • Elasticsearch
    • Speed by Example
    • Search
    • Aggregations
    • Operating System
    • Distributed aspects
    Agenda

    View Slide

  3. www.elastic.co
    3
    • Me
    • joined in march 2013
    • working on Elasticsearch
    & Shield
    • Interested in all things
    scale & search
    About
    • Elastic
    • Founded in 2012
    • Behind: Elasticsearch, Logstash,
    Kibana, Marvel, Shield, ES for
    Hadoop, Elasticsearch clients
    • Support subscriptions
    • Public & private trainings

    View Slide

  4. www.elastic.co
    4
    • Me
    • joined in march 2013
    • working on Elasticsearch
    & Shield
    • Interested in all things
    scale & search
    About
    • Elastic
    • Founded in 2012
    • Behind: Elasticsearch, Logstash,
    Kibana, Marvel, Shield, ES for
    Hadoop, Elasticsearch clients
    • Support subscriptions
    • Public & private trainings

    View Slide

  5. www.elastic.co
    5
    Elasticsearch - High level overview
    Introduction

    View Slide

  6. www.elastic.co
    6
    an open source, distributed, scalable,
    highly available, document-oriented, RESTful
    full text search engine
    with real-time search and analytics capabilities
    Elasticsearch is...

    View Slide

  7. www.elastic.co
    7
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...
    Apache 2.0 License
    https://www.apache.org/licenses/LICENSE-2.0

    View Slide

  8. www.elastic.co
    8
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...

    View Slide

  9. www.elastic.co
    9
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...

    View Slide

  10. www.elastic.co
    10
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...

    View Slide

  11. www.elastic.co
    11
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...
    {
    "name" : "Craft"
    "geo" : {
    "city" : "Budapest",
    "lat" : 47.49, "lon" : 19.04
    }
    }
    Source:  http://json.org/

    View Slide

  12. www.elastic.co
    12
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...
    Source:  https://httpwg.github.io/asset/http.svg

    View Slide

  13. www.elastic.co
    13
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...

    View Slide

  14. www.elastic.co
    14
    an open source, distributed, scalable, highly available, document-oriented,
    RESTful, full text search engine with real-time search and analytics capabilities
    Elasticsearch is...

    View Slide

  15. www.elastic.co
    15
    Getting up and running... is easy
    # wget https://download.elastic.co/elasticsearch/
    elasticsearch/elasticsearch-1.5.1.zip
    # unzip elasticsearch-1.5.1.zip
    # cd elasticsearch-1.5.1
    # ./bin/elasticsearch
    # curl http://localhost:9200

    View Slide

  16. www.elastic.co
    16
    Scaling

    View Slide

  17. www.elastic.co
    17
    Cluster: A collection of nodes
    node
    cluster

    View Slide

  18. www.elastic.co
    18
    Cluster: A collection of nodes
    node
    cluster
    node

    View Slide

  19. www.elastic.co
    cluster
    19
    Cluster: A collection of nodes
    node node

    View Slide

  20. www.elastic.co
    20
    Cluster: A collection of nodes
    node node node node
    cluster

    View Slide

  21. www.elastic.co
    21
    Shards: Unit of scale
    a0 a1 a2
    a3
    # curl -X PUT http://localhost:9200/a -d '
    { "index.number_of_shards" : 4 }'

    View Slide

  22. www.elastic.co
    22
    Shards: Unit of scale
    a0 a1 a2 a3
    # curl -X PUT http://localhost:9200/a -d '
    { "index.number_of_shards" : 4 }'

    View Slide

  23. www.elastic.co
    23
    Shards: Unit of scale
    a0 a1 a2 a3
    # curl -X PUT http://localhost:9200/a/_settings -d '
    { "index.number_of_replicas" : 1 }'

    View Slide

  24. www.elastic.co
    24
    Replication
    a0 a1 a2 a3
    # curl -X PUT http://localhost:9200/a/_settings -d '
    { "index.number_of_replicas" : 1 }'
    a0
    a3 a2
    a1

    View Slide

  25. www.elastic.co
    25
    Search

    View Slide

  26. www.elastic.co
    26
    CRUD
    PUT books/book/1
    {
    "name" : "Elasticsearch - The definitive guide",
    "authors" : [ "Clinton Gormley", "Zachary Tong" ],
    "pages" : 722,
    "published_at" : "2015/01/31"
    }

    View Slide

  27. www.elastic.co
    27
    CRUD
    PUT books/book/1
    {
    "name" : "Elasticsearch - The definitive guide",
    "authors" : [ "Clinton Gormley", "Zachary Tong" ],
    "pages" : 722,
    "published_at" : "2015/01/31"
    }
    GET books/book/1

    View Slide

  28. www.elastic.co
    28
    CRUD
    PUT books/book/1
    {
    "name" : "Elasticsearch - The definitive guide",
    "authors" : [ "Clinton Gormley", "Zachary Tong" ],
    "pages" : 722,
    "published_at" : "2015/01/31"
    }
    GET books/book/1
    DELETE books/book/1

    View Slide

  29. www.elastic.co
    29
    CRUD
    PUT books/book/1
    {
    "name" : "Elasticsearch - The definitive guide",
    "authors" : [ "Clinton Gormley", "Zachary Tong" ],
    "pages" : 722,
    "published_at" : "2015/01/31"
    }
    GET books/book/1
    DELETE books/book/1
    GET books/book/_search?q=elasticsearch

    View Slide

  30. www.elastic.co
    30
    Searching
    GET books/book/_search
    {
    "query" : { "filtered" : {
    "query" : { "match" : { "name" : "elasticsearch" }},
    "filter" : {
    "range" : { "published_at" : { "gte" : "now-1y" } }
    }
    }}
    }

    View Slide

  31. www.elastic.co
    31
    Searching
    GET books/book/_search
    {
    "query" : { "filtered" : {
    "query" : { "match" : { "name" : "elasticsearch" }},
    "filter" : {
    "range" : { "published_at" : { "gte" : "now-1y" } }
    }
    }}
    }
    {
    "took": 3, "timed_out": false,
    "_shards": { "total": 5, "successful": 5, "failed": 0 },
    "hits": {
    "total": 1, "max_score": 0.15342641,
    "hits": [ {
    "_index": "books", "_type": "book", "_id": "1",
    "_score": 0.15342641,
    "_source": {
    "name": "Elasticsearch - The definitive guide",
    "authors": [ "Clinton Gormley", "Zachary Tong" ],
    "pages": 722, "category": "search"
    "published_at": "2015/01/31",
    } } ] } }

    View Slide

  32. www.elastic.co
    32
    Searching
    GET books/book/_search
    {
    "query" : { "filtered" : {
    "query" : { "match" : { "name" : "elasticsearch" }},
    "filter" : {
    "range" : { "published_at" : { "gte" : "now-1y" } }
    }
    }},
    "aggs" : {
    "category" : { "terms" : { "field" : "category" } }
    }
    }

    View Slide

  33. www.elastic.co
    GET books/book/_search
    {
    "query" : { "filtered" : {
    "query" : { "match" : { "name" : "elasticsearch" }},
    "filter" : {
    "range" : { "published_at" : { "gte" : "now-1y" } }
    }
    }},
    "aggs" : {
    "category" : { "terms" : { "field" : "category" } }
    }
    }
    33
    Searching
    {
    "took": 3, "timed_out": false,
    "_shards": { "total": 5, "successful": 5, "failed": 0 },
    "hits": {
    "total": 1, "max_score": 0.15342641,
    "hits": [ ... ]
    },
    "aggregations": {
    "category": {
    "buckets": [
    { "key": "search", "doc_count": 1 },
    { ... }
    ] } } }

    View Slide

  34. www.elastic.co
    34
    • Hits all relevant shards
    • Searches for top-N results per shard
    • Reduces to top-N total
    • Gets top-N documents/data from relevant shards
    • Returns data to requesting client
    Search

    View Slide

  35. www.elastic.co
    35
    • Lucene is doing the heavy lifting
    • A single shard is a Lucene index
    • Each field is its own inverted index and can be searched in
    Search on a single shard
    term docid
    clinton 1
    gormley 1
    tong 1
    zachary 1

    View Slide

  36. www.elastic.co
    36
    • It is very hard to reconstruct the original data from the inverted
    index
    • Solution: Just store the whole document in its own field and
    retrieve it, when returning data to the client
    Example: Return original JSON in search response
    {
    "name" : "Elasticsearch - The definitive guide",
    "authors" : [ "Clinton Gormley", "Zachary Tong" ],
    "pages" : 722,
    "published_at" : "2015/01/31"
    }
    _source

    View Slide

  37. www.elastic.co
    Elasticsearch - The definitive guide
    Clinton Gormley Zachary Tong
    722
    2015/01/31
    37
    Example: _all field
    {
    "name" : "Elasticsearch - The definitive guide",
    "authors" : [ "Clinton Gormley", "Zachary Tong" ],
    "pages" : 722,
    "published_at" : "2015/01/31"
    }
    _all

    View Slide

  38. www.elastic.co
    Elasticsearch - The definitive guide
    Clinton Gormley Zachary Tong
    38
    Example: copy_to field (name & authors)
    {
    "name" : "Elasticsearch - The definitive guide",
    "authors" : [ "Clinton Gormley", "Zachary Tong" ],
    "pages" : 722,
    "published_at" : "2015/01/31"
    }
    copy_to

    View Slide

  39. www.elastic.co
    39
    • Filters do not contribute to score & can be cached using a BitSet
    • range filter for a date/price
    • term filter for a category
    • geo filter for a bounding box
    Search: Using filters
    0 1 1 0 0 1 { "term" : { "category" : "search" } }
    0 1 0 0 1 0 { "term" : { "category" : "reduced" } }

    View Slide

  40. www.elastic.co
    40
    • Problem: How to search in an inverted index for non-existing fields
    (exists & missing filter)?
    • Costly: Need to merge postings lists of all existing terms
    (expensive for high-cardinality fields!)
    • Solution: Index document field names under _field_names
    Filters: Missing fields

    View Slide

  41. www.elastic.co
    41
    Aggregations

    View Slide

  42. www.elastic.co
    42
    • Aggregations: Buckets & metrics
    • Aggregations cannot make use of the inverted index
    • Meet Fielddata: Uninverting the index
    • Inverted index: Maps term to document id
    • Fielddata: Maps document id to terms
    Aggregations

    View Slide

  43. www.elastic.co
    43
    Aggregations
    docid term
    1 Clinton Gormley,
    Zachary Tong
    Inverted Index
    Fielddata
    term docid
    clinton 1
    gormley 1
    tong 1
    zachary 1

    View Slide

  44. www.elastic.co
    44
    Aggregations
    docid term
    1 Clinton Gormley,
    Zachary Tong
    Inverted Index
    Fielddata
    term docid
    Clinton Gormley 1
    Zachary Tong 1

    View Slide

  45. www.elastic.co
    45
    • Fielddata is an in-memory data structure, lazily constructed
    • Easy to go OOM (wrong field or too many documents)
    • Solution:
    • circuit breaker
    • doc_values: index-time data structure, no heap, uses the file
    system cache, better compression
    Aggregations: Fielddata

    View Slide

  46. www.elastic.co
    46
    • Problem: Count distinct elements
    • Naive: Load all data into a set, then check the size (distributed?)
    • Solution: cardinality Aggregation, that uses HyperLogLog++
    • configurable precision, allows to trade memory for accuracy
    • excellent accuracy on low-cardinality sets
    • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only
    depends on configured precision
    Aggregations: Probabilistic data structures

    View Slide

  47. www.elastic.co
    47
    • Problem: Calculate percentiles
    • Naive: Maintain a sorted list of all values
    • Solution: percentiles Aggregation, that uses T-Digest
    • extreme percentiles are more accurate
    • small sets can be up to 100% accurate
    • while values are added to a bucket, the algorithm trades accuracy for memory savings
    Aggregations: Probabilistic data structures

    View Slide

  48. www.elastic.co
    48
    Operating system
    & Hardware

    View Slide

  49. www.elastic.co
    49
    • CPU
    Indexing, searching, highlighting
    • I/O
    Indexing, searching, merging
    • Memory
    Aggregation, indices
    • Network
    Relocation, Snapshot & Restore
    Elasticsearch can easily max out...

    View Slide

  50. www.elastic.co
    50
    • CPU: Threadpools are sized on number of cores
    • Disk: SSD
    • Memory: ∞
    • Network: GbE or better
    Hardware

    View Slide

  51. www.elastic.co
    51
    • file system cache
    • file handles
    • memory locking: bootstrap.mlockall
    • dont swap, no OOM killer
    Operating system

    View Slide

  52. www.elastic.co
    52
    Distributed aspects

    View Slide

  53. www.elastic.co
    53
    • The network is reliable
    • Latency is zero
    • Bandwidth is infinite
    • The network is secure
    Fallacies of distributed computing
    • Topology doesn't change
    • There is one administrator
    • Transport cost is zero
    • The network is homogeneous
    by  Peter  Deutsch  
    https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

    View Slide

  54. www.elastic.co
    54
    Wrapup

    View Slide

  55. www.elastic.co
    55
    • Speed is key!
    • Search is a tradeoff: Query time vs. index time
    • Benchmark your use-case
    • http://benchmarks.elasticsearch.org/
    Summary

    View Slide

  56. www.elastic.co
    56
    • Automatic I/O throttling
    • Clusterstate incremental updates
    • Faster recovery
    • Aggregations 2.0
    • Merge queries and filters
    • Reindex API
    • Changes API
    • Expression scripting engine
    Elasticsearch 2.x

    View Slide

  57. www.elastic.co
    57
    • Speed improvements in queries (must_not, sloppy phrase)
    • Automated caching
    • BitSet compression vastly improved (roaring bitsets)
    • Index compression (on disk + memory)
    • Indexing performance (adaptive merge throttling, SSD detection)
    • Index safety: atomic commits, segment commit identifiers, verify
    integrity at merge
    • ...
    Lucene 5.x

    View Slide

  58. www.elastic.co
    58
    Alexander Reelsen
    [email protected]
    @spinscale
    Thanks for listening! Questions?
    We’re  hiring  
    https://www.elastic.co/about/careers  
    We’re  helping  
    https://www.elastic.co/subscriptions

    View Slide

  59. www.elastic.co
    59
    http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
    http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html
    http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-field-names-field.html
    http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#copy-to
    http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html
    http://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
    http://www.elastic.co/guide/en/elasticsearch/guide/current/percentiles.html
    http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html
    http://www.elastic.co/elasticon/2015/sf/elasticsearch-architecture-amusing-algorithms-and-details-on-data-structures/
    http://speakerdeck.com/elastic/all-about-aggregations
    http://www.elastic.co/elasticon/2015/sf/updates-from-lucene-land
    http://speakerdeck.com/elastic/resiliency-in-elasticsearch-and-lucene
    http://www.elastic.co/elasticon/2015/sf/level-up-your-clusters-upgrading-elasticsearch
    http://speakerdeck.com/elasticsearch/maintaining-performance-in-distributed-systems
    References

    View Slide