Every Shard Deserves a Home - Shard Allocation in Elasticsearch

9a2049bf377d85f15dd1f7a3ce697a68?s=47 Boaz Leskes
January 13, 2016

Every Shard Deserves a Home - Shard Allocation in Elasticsearch

This talk was given at the Elastic meetup in Tel Aviv. The talk is about the journey of a shard in Elasticsearch. It will cover the mechanics and the reasons for how ES decides to allocate shards to nodes and how those decisions are executed.

We will start with the assignment of new shards, conforming to the current Allocation Filtering, disk space usage and other factors. Once shards are started, they may be needed to be moved around. This can be due to a cluster topology change, data volumes or when someone instructed ES to so through the index management APIs.

We will cover these triggers and how ES executes on those decisions, moving potentially tens of gigabytes of data from one node to another, without dropping any search or an indexing request. We will finalize with the mechanics of full/rolling cluster restarts and the recent improvements such as synced flush, delayed assignment on node leave and cancelling ongoing relocations if a perfect match is found.

9a2049bf377d85f15dd1f7a3ce697a68?s=128

Boaz Leskes

January 13, 2016
Tweet

Transcript

  1. Every Shard Deservers a Home Shard Allocation at Elasticsearch @bleskes

    Boaz Leskes
  2. 2 A Cluster node4 node1 a0 b1 b2 node3 b1

    a0 node2 a1 b0 b0 a1 b2 Primary Replica
  3. 3 Index Creation

  4. 4 Index Creation node4 node1 a0 b1 b2 node3 b1

    a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0 PUT index_name/type/1
  5. 5 Index creation - Allocation Deciders - Filtering node4 (type:cold)

    node1 (type:hot) a0 b1 b2 node3 (type:cold) b1 a0 node2 (type:hot) a1 b0 b0 a1 b2 unassigned c0 c0 PUT index_name/type/1 index.routing.allocation.require.type: hot
  6. 6 Index creation - Allocation Deciders - Disk Threshold node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 cluster.routing.allocation.disk.watermark.high: 90% 91% disk usage unassigned
  7. 7 Index creation - Allocation Deciders - Throttling node4 node1

    a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 throttle unassigned
  8. 8 Index creation - Primary Assigned (initializing) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 c0 unassigned
  9. 9 Index creation - Shard Initialization master node2 a1 b0

    cluster state c0
  10. 9 Index creation - Shard Initialization master node2 a1 b0

    cluster state c0 • detect assignment
  11. 9 Index creation - Shard Initialization master node2 a1 b0

    cluster state c0 • detect assignment • initialize an empty shard
  12. 9 Index creation - Shard Initialization master node2 a1 b0

    c0 shard ready • detect assignment • initialize an empty shard • notify master when done
  13. 9 Index creation - Shard Initialization master node2 a1 b0

    c0 shard ready • detect assignment • initialize an empty shard • notify master when done • mark shard as started
  14. 9 Index creation - Shard Initialization master node2 a1 b0

    c0 cluster state • detect assignment • initialize an empty shard • notify master when done • mark shard as started
  15. 9 Index creation - Shard Initialization master node2 a1 b0

    cluster state • detect assignment • initialize an empty shard • notify master when done • mark shard as started • activate the shard c0
  16. 10 Index creation - Primary Assigned (started) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0
  17. 10 Index creation - Primary Assigned (started) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0
  18. 10 Index creation - Primary Assigned (started) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0 c0
  19. 11 Index creation - Replica Initialization master node2 a1 b0

    c0 node1 a1 b0 cluster state c0
  20. 11 Index creation - Replica Initialization master node2 a1 b0

    c0 • detect assignment node1 a1 b0 cluster state c0
  21. 11 Index creation - Replica Initialization master node2 a1 b0

    c0 • detect assignment • start recovery from primary node1 a1 b0 cluster state c0
  22. 11 Index creation - Replica Initialization master node2 a1 b0

    c0 shard ready • detect assignment • start recovery from primary • notify master when done node1 a1 b0 c0
  23. 11 Index creation - Replica Initialization master node2 a1 b0

    c0 shard ready • detect assignment • start recovery from primary • notify master when done • mark replica as started node1 a1 b0 c0
  24. 11 Index creation - Replica Initialization master node2 a1 b0

    c0 cluster state • detect assignment • start recovery from primary • notify master when done • mark replica as started node1 a1 b0 c0
  25. 11 Index creation - Replica Initialization master node2 a1 b0

    cluster state • detect assignment • start recovery from primary • notify master when done • mark replica as started • activate the replica c0 node1 a1 b0 c0
  26. 12 Time to move a shard node4 node1 a0 b1

    b2 node3 b1 a0 node2 a1 b0 b0 a1 b2
  27. 13 Time to move a shard - Explicit User Command

    node4 (type:cold) node1 (type:hot) a0 b1 b2 node3 (type:cold) b1 node2 (type:hot) a1 b0 a1 b2 POST index_name/_settings index.routing.allocation.require.type: cold
  28. 14 Time to move a shard - Disk Threshold Allocation

    Decider node4 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 cluster.routing.allocation.disk.watermark.low: 85% 86% disk usage
  29. 15 Time to move a shard - Nodes Added node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  30. 15 Time to move a shard - Nodes Added node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  31. 15 Time to move a shard - Nodes Added node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0
  32. 16 Shard Data Storage Intermezzo

  33. 17 Data Storage - Lucene Segments index a doc time

    lucene flush buffer index a doc buffer index a doc buffer buffer segment
  34. 18 Data Storage - Lucene Segments index a doc time

    lucene flush buffer index a doc buffer index a doc buffer buffer segment segment segment segment segment
  35. 19 Data Storage - Transaction Log index a doc time

    lucene flush buffer segment trans_log buffer trans_log buffer trans_log elasticsearch flush doc op lucene commit segment segment
  36. 20 Data Storage - Lucene Segments + Transaction Log •

    Taking the current set of segments gives a point in time snapshot of the data
 • Not flushing the translog keeps a complete operation history
  37. 21 Back to Relocation

  38. 22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  39. 22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  40. 22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0
  41. 22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0
  42. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 cluster state lcn
  43. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 cluster state • detect assignment lcn
  44. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery
  45. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment cluster state
  46. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion
  47. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene
  48. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target send segments segment 4 segment 5 segment 2 segment 3 lcn
  49. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4
  50. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • finishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done
  51. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • finishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 • notifies master shard ready
  52. 23 Relocation - Recover from Primary node1 node 5 tlog

    segment 3 • detect assignment • sends start recovery request lcn • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • finishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 • notifies master • master activate shard (and removes it from node5) cluster state
  53. 24 Relocation - balance some more node4 node1 a0 b1

    b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0 b2
  54. 25 Full Cluster Restart

  55. 26 Full Cluster Restart node4 node1 a0 b1 b2 node3

    b1 a0 node2 a1 b0 b0 a1 b2
  56. 27 Full Cluster Restart - Master Fetches Store Information node4

    node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 unassigned
  57. 27 Full Cluster Restart - Master Fetches Store Information node4

    node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 unassigned
  58. 28 Full Cluster Restart - Allocated Existing Copy as Primary

    node4 node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 a0 unassigned
  59. 29 Full Cluster Restart - Allocated Existing Copy as Primary

    node4 node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned
  60. 30 Full Cluster Restart - Replica Allocation - Fetch Store

    node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned
  61. 30 Full Cluster Restart - Replica Allocation - Fetch Store

    node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned
  62. 31 Full Cluster Restart - Replica Allocation - Fetch Store

    node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 need a home a0 a0
  63. 32 Full Cluster Restart - Replica Allocation node4 node1 (master)

    b1 node3 b1 a0 node2 a1 b0 b0 a1 need a home a0 a0
  64. 33 Full Cluster Restart - Recover from Primary node 4

    node 1 tlog segment 2 segment 3 cluster state lcn start recovery send segments replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done cluster state Reuse existing data
  65. 34 Segments Reuse & Synced Flush segment 2 + 3

    segment 4 segment 5 + 6 Reuse existing data? Shard 1 Shard 2 segment 2 segment 3 + 4 segment 5 segment 6
  66. 35 Segments Reuse & Synced Flush segment 2 + 3

    segment 4 segment 5 + 6 automatically use inactivity periods to add a
 sync id marker, guarantying doc level equality Shard 1 Shard 2 segment 2 segment 3 + 4 segment 5 segment 6 sync_id: 0XYB321 sync_id: 0XYB321
  67. 36 Full Cluster Restart - Recover with a matching sync

    id node 4 node 1 tlog segment 2 segment 3 cluster state lcn start recovery replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done cluster state Reuse existing data! sync id
  68. 37 Singe Node Loss

  69. 38 Single Node Loss node4 node1 a0 b1 b2 node3

    b1 a0 node2 a1 b0 b0 a1 b2
  70. 39 Single Node Loss node4 node1 a0 b1 b2 node3

    b1 a0 node2 a1 b0 b0 a1 b2
  71. 40 Single Node Loss - Promote Primaries and Replica node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 b1 a1 needed but potentially expensive
  72. 41 Single Node Loss - A Grace Period node4 node1

    a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 index.unassigned.node_left.delayed_timeout: 1m
  73. 42 Single Node Loss - Node Returns, shard re-assigned node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 index.unassigned.node_left.delayed_timeout: 1m
  74. 43 Single Node Loss - Node Returns After Period Expires

    (v2.0) node4 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 b1 a1 cancel recoveries if sync-flushed
  75. Thank you! elastic.co/guide