$30 off During Our Annual Pro Sale. View Details »

Resiliency in Elasticsearch & Lucene

Resiliency in Elasticsearch & Lucene

As one of the most popular search engines based on Apache Lucene, Elasticsearch recognizes the crucial importance of being resilient to hardware and network failure. This is why Elasticsearch invests a lot to enable Elasticsearch and Apache Lucene detect and cope with increasingly complex failures. Elasticsearch’s lead developer, Boaz Leskes, will cover the recent highlights and future plans of the company’s resiliency strategy. He will explain all aspects of Elasticsearch, ranging from the lowest level of a single file, through network connection of a single node, and all the way up to distributed failures on the cluster level. Even though the talk is about possible failures and various coping strategies, participants will also get an interesting peek under the hood and learn about the inner workings of Elasticsearch.

Talk given at the Tel Aviv Elastic Meetup

Boaz Leskes

July 08, 2015
Tweet

More Decks by Boaz Leskes

Other Decks in Technology

Transcript

  1. Resiliency in Elasticsearch and Lucene
    Boaz Leskes and Igor Motov
    four months later

    View Slide

  2. { } CC-BY-ND 4.0
    Resiliency
    noun
    1. the power or ability to return to the original form, position,
    etc., after being bent, compressed, or stretched; elasticity.
    2. ability to recover readily from illness, depression, adversity, or
    the like; buoyancy.
    2
    re·sil·ience

    View Slide

  3. { } CC-BY-ND 4.0
    Failures happen
    3

    View Slide

  4. { } CC-BY-ND 4.0
    Why is it (even more) important now?
    4

    View Slide

  5. { } CC-BY-ND 4.0
    Average Elasticsearch cluster growth
    5
    Large cluster circa 2011

    View Slide

  6. { } CC-BY-ND 4.0
    Average Elasticsearch cluster growth
    6
    Large cluster circa 2011 Large cluster circa 2015

    View Slide

  7. { } CC-BY-ND 4.0
    Failure rates X number of nodes
    7

    View Slide

  8. { } CC-BY-ND 4.0 8
    Slow
    Fast
    Node Cluster
    kill -9
    dead disk
    corruption
    long GC
    master gone
    network disconnects
    timeouts

    View Slide

  9. { } CC-BY-ND 4.0 9
    Slow
    Fast
    Node Cluster
    kill -9
    dead disk
    corruption
    long GC
    master gone
    network disconnects
    timeouts
    SOFTWARE
    BUGS

    View Slide

  10. { } CC-BY-ND 4.0 10
    Work in progress

    View Slide

  11. { } CC-BY-ND 4.0
    Pulling the plug
    11

    View Slide

  12. { } CC-BY-ND 4.0
    Replicas and Transaction Log
    12
    Replica Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer
    Primary Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer

    View Slide

  13. { } CC-BY-ND 4.0
    Replicas and Transaction Log
    13
    Replica Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer
    Primary Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer

    View Slide

  14. { } CC-BY-ND 4.0
    Replicas and Transaction Log
    14
    Replica Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer
    Primary Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer

    View Slide

  15. { } CC-BY-ND 4.0
    Replicas and Transaction Log
    15
    Replica Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer
    Primary Shard
    Lucene
    Index
    Transaction Log
    Lucene
    Buffer

    View Slide

  16. { } CC-BY-ND 4.0
    Transaction Log
    • Transaction Log
    – stores every operation (create/update/delete)
    – fsync-ed every 5 sec (configurable)
    • every request (default - #11011, coming v2.0)
    • Lucene Segments
    – fsync-ed when transaction log is full (every 30 min or
    512mb)
    16

    View Slide

  17. { } CC-BY-ND 4.0
    Hard Disk Failures
    17

    View Slide

  18. { } CC-BY-ND 4.0
    Hard Drive Failures
    • Complete failure
    • Running out of disk space
    • Data corruption
    18

    View Slide

  19. { } CC-BY-ND 4.0
    Complete Disk Failures
    • Automatic shard failover
    • Replicas
    19

    View Slide

  20. { } CC-BY-ND 4.0
    Multi data paths
    20
    Disk 1 Disk 2 Disk3 Disk 4
    shard 1
    shard 2
    shard 3
    shard 4

    View Slide

  21. { } CC-BY-ND 4.0
    Multi data paths
    21
    Disk 1 Disk 2 Disk3 Disk 4
    shard 1
    shard 2
    shard 3
    shard 4

    View Slide

  22. { } CC-BY-ND 4.0
    Multi data paths
    22
    Disk 1 Disk 2 Disk3 Disk 4
    shard 1
    shard 2
    shard 3
    shard 4

    View Slide

  23. { } CC-BY-ND 4.0
    Complete Disk Failures (WIP)
    • Current multi-path setup strips shard data across multiple
    disks
    • Disk loss impacts all shards on node
    • Reduce failure impact by using one disk per shard 

    (v2.0, #9498)
    23

    View Slide

  24. { } CC-BY-ND 4.0
    Running out of Disk Space
    • Can lead to truncated files and thus corruption
    • Easy to anticipate by monitoring
    • Disk-space aware allocation decider
    – added in 0.90.4
    – enabled by default since 1.3.5
    • Check-pointed transaction log (v2.0 , #11143)
    24

    View Slide

  25. { } CC-BY-ND 4.0
    file content (10GB)
    Data Corruption - the bit in the haystack
    25

    View Slide

  26. { } CC-BY-ND 4.0
    Data Corruption - Checksums
    26
    footer + checksum
    file content (10gb)

    View Slide

  27. { } CC-BY-ND 4.0
    Data Corruption - Checksums
    • Elasticsearch automatically checks checksums of
    – Small files on index open (since v1.3.0)
    – All files during replication, relocation, snapshot, and
    restore (since v1.3.3 - v1.4.0)
    – Transaction log (since v1.4.0)
    – Use checksum to identify entire segments to reduce
    chance of hash collisions (since v1.4.0)
    27

    View Slide

  28. { } CC-BY-ND 4.0
    Data Corruption (WIP)
    • Checksums of Metadata files (coming in v1.5.0, #8010)
    • Support validation of checksum on all files when node starts
    (v2.0.0, #9183)
    • Make validation during merge operation more efficient (v2.0.0,
    LUCENE-5894)
    • Add per-segment/per-commit ids (v2.0.0, LUCENE-5895)
    • Prevent use of known-bad java versions (coming in v1.5.0,
    #7580)
    28

    View Slide

  29. { } CC-BY-ND 4.0
    Cluster Issues
    29

    View Slide

  30. { } CC-BY-ND 4.0
    Why nodes leave the cluster?
    • Complete node failure
    • Unresponsive nodes
    • Network Failures
    30

    View Slide

  31. { } CC-BY-ND 4.0
    Unresponsive Nodes
    31

    View Slide

  32. { } CC-BY-ND 4.0
    Biggest Memory User- Field Data (v1.x)
    • sorting
    • aggregations
    • doc[“foo”] in scripts
    • Parent-child id cache
    – has_child/has_parent queries
    32

    View Slide

  33. { } CC-BY-ND 4.0
    Circuit Breakers
    • Estimate size of the field data for each query and fail the
    query if it tries to load too much data
    – field data (since v1.0.0)
    – parent-child (since v1.1.0)
    – some aggregation structures (since v1.4.0)
    33

    View Slide

  34. { } CC-BY-ND 4.0
    Doc values
    • On-disk low memory alternative to field data
    • Significant performance improvements in v1.4.0
    • Enabled by default for all numeric and non-analyzed
    fields (#10209, v2.0)
    34

    View Slide

  35. { } CC-BY-ND 4.0
    OOM Resiliency (WIP)
    • Add hard limit on from/size (coming in v1.5.0, #9311)
    • Add hit size circuit breaker (coming in v2.0, #9310)
    • Prevent combinatorial explosion in aggregations (TBD,
    #8081, #9825)
    • Smarter filter query caching (LUCENE-6303, v2.0,
    #10897)
    35

    View Slide

  36. { } CC-BY-ND 4.0
    Dedicate Master Nodes
    36
    node 1
    node.master: false
    node.data: true
    node 1
    node 1
    node 1
    node 1
    node 1
    node 1
    node 1
    node 1
    node 1
    node 1
    data 1
    node.master: true
    node.data: false
    master 1 master 2 master 3

    View Slide

  37. { } CC-BY-ND 4.0
    Network Issues
    37

    View Slide

  38. { } CC-BY-ND 4.0
    Partitions & partial knowledge
    38
    node
    M
    node node
    node
    node

    View Slide

  39. { } CC-BY-ND 4.0
    Remember to set minimum_master_nodes
    • Set discovery.zen.minumum_master_nodes:(N/2 +1) in
    elasticsearch.yml
    39

    View Slide

  40. { } CC-BY-ND 4.0
    Partitions & partial knowledge
    40
    node
    M
    node node
    node
    node

    View Slide

  41. { } CC-BY-ND 4.0
    Improved Zen Discovery
    • Significant improvements in v1.4.0
    – Gossip on master loss
    – bigger ping outreach
    – resiliency to stale gossip
    – better two-masters resolution
    – faster failure detection
    41

    View Slide

  42. { } CC-BY-ND 4.0
    Improving Zen Discovery (WIP)
    • Prevent setting incorrect minimum_master_nodes
    (coming in v1.5.0, #8321, #9051)
    • Refuse revived master (coming in v1.5.0, #9632)
    • Diff based ClusterState publishing (v2.0 #10212)
    42

    View Slide

  43. { } CC-BY-ND 4.0
    External pressure
    43

    View Slide

  44. { } CC-BY-ND 4.0
    External pressure
    - bounded queues and thread pools (<0.20)
    - time out long running queries (WIP, PR #9156)
    - index throttling (v1.2.0, #6066)
    44

    View Slide

  45. { } CC-BY-ND 4.0
    Known Unknowns
    45

    View Slide

  46. { } CC-BY-ND 4.0
    Known Unknowns
    - Simulate disruptions (v1.4.0, #7492)
    - Simulate corruption (v1.3.0, #5924)
    - Reproducible evil
    - Users info is critical
    46

    View Slide

  47. { } CC-BY-ND 4.0
    It’s an ongoing effort
    • Check the progress on our resiliency status page
    – http://www.elasticsearch.org/guide/en/elasticsearch/
    resiliency/current/
    • Or search for issues labeled “resiliency” on github
    – https://github.com/elasticsearch/elasticsearch/
    47

    View Slide

  48. { }
    Thank you!
    [email protected], [email protected]
    @bleskes, @imotov

    View Slide

  49. { }
    This work is licensed under the Creative Commons
    Attribution-NoDerivatives 4.0 International License.
    To view a copy of this license, visit:
    http://creativecommons.org/licenses/by-nd/4.0/
    or send a letter to:
    Creative Commons
    PO Box 1866
    Mountain View, CA 94042
    USA
    CC-BY-ND 4.0

    View Slide