Resiliency in Elasticsearch and Lucene

Dd9d954997353b37b4c2684f478192d3?s=47 Elastic Co
March 19, 2015
1.5k

Resiliency in Elasticsearch and Lucene

As Elasticsearch clusters grow larger, their resilience to hardware and network failure becomes increasingly important. At Elasticsearch, we invest a LOT in making both Elasticsearch and Apache Lucene both detect and cope with increasingly complex failures.

In this talk we will go through recent highlights and some of the future directions we plan to follow. We will touch all aspects of Elasticsearch, ranging from the lowest level of a single file, through network connection of a single node, and all the way up distributed failures on the cluster level. Even though the talk is about possible failures and various coping strategies, you will also learn about the inner workings of Elasticsearch - an interesting peek under the hood.

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 19, 2015
Tweet

Transcript

  1. Resiliency in Elasticsearch and Lucene Boaz Leskes and Igor Motov

  2. { } CC-BY-ND 4.0 Resiliency noun 1. the power or

    ability to return to the original form, position, etc., after being bent, compressed, or stretched; elasticity. 2. ability to recover readily from illness, depression, adversity, or the like; buoyancy. 2 re·sil·ience
  3. { } CC-BY-ND 4.0 Failures happen 3

  4. { } CC-BY-ND 4.0 Why is it so important now?

    4
  5. { } CC-BY-ND 4.0 Average Elasticsearch cluster growth 5 Large

    cluster circa 2011
  6. { } CC-BY-ND 4.0 Average Elasticsearch cluster growth 6 Large

    cluster circa 2011 Large cluster circa 2015
  7. { } CC-BY-ND 4.0 Failure rates and number of nodes

    7
  8. { } CC-BY-ND 4.0 8 Slow Fast Node Cluster kill

     -­‐9 dead  disk corruption long  GC master  gone network  disconnects timeouts
  9. { } CC-BY-ND 4.0 9 Slow Fast Node Cluster kill

     -­‐9 dead  disk corruption long  GC master  gone network  disconnects timeouts SOFTWARE   BUGS
  10. { } CC-BY-ND 4.0 10 Slow Fast Node Cluster kill

     -­‐9 dead  disk corruption long  GC master  gone network  disconnects timeouts
  11. { } CC-BY-ND 4.0 11 Work in progress

  12. { } CC-BY-ND 4.0 Pulling the plug 12

  13. { } CC-BY-ND 4.0 Replicas and Transaction Log 13 Replica

    Shard Lucene Index Transaction Log Lucene Buffer Primary Shard Lucene Index Transaction Log Lucene Buffer
  14. { } CC-BY-ND 4.0 Replicas and Transaction Log 14 Replica

    Shard Lucene Index Transaction Log Lucene Buffer Primary Shard Lucene Index Transaction Log Lucene Buffer
  15. { } CC-BY-ND 4.0 Replicas and Transaction Log 15 Replica

    Shard Lucene Index Transaction Log Lucene Buffer Primary Shard Lucene Index Transaction Log Lucene Buffer
  16. { } CC-BY-ND 4.0 Replicas and Transaction Log 16 Replica

    Shard Lucene Index Transaction Log Lucene Buffer Primary Shard Lucene Index Transaction Log Lucene Buffer
  17. { } CC-BY-ND 4.0 Transaction Log • Transaction Log –

    stores every operation (create/update/delete) – fsync-ed every 5 sec (configurable) • Lucene Segments – fsync-ed when transaction log is full (every 30 min or 200mb) 17
  18. { } CC-BY-ND 4.0 Hard Disk Failures 18

  19. { } CC-BY-ND 4.0 Hard Drive Failures • Complete failure

    • Running out of disk space • Data corruption 19
  20. { } CC-BY-ND 4.0 Complete Disk Failures • Automatic shard

    failover • Replicas 20
  21. { } CC-BY-ND 4.0 Multi data paths 21 Disk 1

    Disk 2 Disk3 Disk 4 shard 1 shard 2 shard 3 shard 4
  22. { } CC-BY-ND 4.0 Multi data paths 22 Disk 1

    Disk 2 Disk3 Disk 4 shard 1 shard 2 shard 3 shard 4
  23. { } CC-BY-ND 4.0 Multi data paths 23 Disk 1

    Disk 2 Disk3 Disk 4 shard 1 shard 2 shard 3 shard 4
  24. { } CC-BY-ND 4.0 Complete Disk Failures (WIP) • Current

    multi-path setup strips shard data across multiple disks • Disk loss impacts all shards on node • Reduce failure impact by using one disk per shard 
 (v2.0, #9498) 24
  25. { } CC-BY-ND 4.0 Multi data paths 25 Disk 1

    Disk 2 Disk3 Disk 4 shard 1 shard 2 shard 3 shard 4
  26. { } CC-BY-ND 4.0 Running out of Disk Space •

    Can lead to truncated files and thus corruption • Easy to anticipate by monitoring • Disk-space aware allocation decider – added in 0.90.4 – enabled by default since 1.3.5 26
  27. { } CC-BY-ND 4.0 file  content  (10GB) Data Corruption -

    the bit in the haystack 27
  28. { } CC-BY-ND 4.0 Data Corruption - Checksums 28 footer

     +  checksum file  content  (10gb)
  29. { } CC-BY-ND 4.0 Data Corruption • Elasticsearch automatically checks

    checksums of – Small files on index open (since v1.3.0) – All files during replication, relocation, snapshot, and restore (since v1.3.3 - v1.4.0) – Transaction log (since v1.4.0) – Use checksum to identify entire segments to reduce chance of hash collisions (since v1.4.0) 29
  30. { } CC-BY-ND 4.0 Data Corruption (WIP) • Checksums of

    Metadata files (coming in v1.5.0) • Support validation of checksum on all files when node starts (v2.0.0, #9183) • Make validation during merge operation more efficient (v2.0.0, LUCENE-5894) • Add per-segment/per-commit ids (v2.0.0, LUCENE-5895) • Prevent use of known-bad java versions (v1.5.0, #7580) 30
  31. { } CC-BY-ND 4.0 Cluster Issues 31

  32. { } CC-BY-ND 4.0 Why nodes leave the cluster? •

    Complete node failure • Unresponsive nodes • Network Failures 32
  33. { } CC-BY-ND 4.0 Unresponsive Nodes 33

  34. { } CC-BY-ND 4.0 Biggest Memory User- Field Data •

    sorting • aggregations • doc[“foo”] in scripts • Parent-child id cache – has_child/has_parent queries 34
  35. { } CC-BY-ND 4.0 Circuit Breakers • Estimate size of

    the field data for each query and fail the query if it tries to load too much data – field data (since v1.0.0) – parent-child (since v1.1.0) – some aggregation structures (since v1.4.0) 35
  36. { } CC-BY-ND 4.0 Doc values • On-disk low memory

    alternative to field data • Significant performance improvements in v1.4.0 36
  37. { } CC-BY-ND 4.0 OOM Resiliency (WIP) • Add hard

    limit on from/size (v1.5.0, #9311) • Add hit size circuit breaker (v1.5.0, #9310) • Prevent combinatorial explosion in aggregations (v2.0.0, #8081) • Smarter filter caching (LUCENE-6303, ES 2.0) 37
  38. { } CC-BY-ND 4.0 Dedicate Master Nodes 38 node 1

    node.master: false node.data: true node 1 node 1 node 1 node 1 node 1 node 1 node 1 node 1 node 1 node 1 data 1 node.master: true node.data: false master 1 master 2 master 3
  39. { } CC-BY-ND 4.0 Network Issues 39

  40. { } CC-BY-ND 4.0 Partitions & partial knowledge 40 node

    M node node node node
  41. { } CC-BY-ND 4.0 Remember to set minimum_master_nodes • Set

    discovery.zen.minumum_master_nodes:(N/2 +1) in elasticsearch.yml 41
  42. { } CC-BY-ND 4.0 Partitions & partial knowledge 42 node

    M node node node node
  43. { } CC-BY-ND 4.0 Improved Zen Discovery • Significant improvements

    in v1.4.0 – Gossip on master loss – bigger ping outreach – resiliency to stale gossip – better two-masters resolution – faster failure detection 43
  44. { } CC-BY-ND 4.0 Improving Zen Discovery (WIP) • Prevent

    setting incorrect minimum_master_nodes (v1.5.0, #8321, #9051) • Refuse revived master (v1.5.0) 44
  45. { } CC-BY-ND 4.0 External pressure 45

  46. { } CC-BY-ND 4.0 External pressure - bounded queues and

    thread pools (<0.20) - time out long running queries (WIP, PR #9156) - index throttling (v1.2.0, #6066) 46
  47. { } CC-BY-ND 4.0 Known Unknowns 47

  48. { } CC-BY-ND 4.0 Known Unknowns - Simulate disruptions (v1.4.0,

    #7492) - Simulate corruption (v1.3.0, #5924) - Reproducible evil - Users info is critical 48
  49. { } CC-BY-ND 4.0 It’s ongoing effort • Check the

    progress on our resiliency status page – http://www.elasticsearch.org/guide/en/elasticsearch/ resiliency/current/ • Or search for issues labeled “resiliency” on github – https://github.com/elasticsearch/elasticsearch/ 49
  50. { } Thank you! boaz@elastic.co, igor@elastic.co @bleskes, @imotov

  51. { } This work is licensed under the Creative Commons

    Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0