Every Shard Deserves a Home - Shard Allocation in Elasticsearch

9a2049bf377d85f15dd1f7a3ce697a68?s=47 Boaz Leskes
January 13, 2016

Every Shard Deserves a Home - Shard Allocation in Elasticsearch

This talk was given at the Elastic meetup in Tel Aviv. The talk is about the journey of a shard in Elasticsearch. It will cover the mechanics and the reasons for how ES decides to allocate shards to nodes and how those decisions are executed.

We will start with the assignment of new shards, conforming to the current Allocation Filtering, disk space usage and other factors. Once shards are started, they may be needed to be moved around. This can be due to a cluster topology change, data volumes or when someone instructed ES to so through the index management APIs.

We will cover these triggers and how ES executes on those decisions, moving potentially tens of gigabytes of data from one node to another, without dropping any search or an indexing request. We will finalize with the mechanics of full/rolling cluster restarts and the recent improvements such as synced flush, delayed assignment on node leave and cancelling ongoing relocations if a perfect match is found.

9a2049bf377d85f15dd1f7a3ce697a68?s=128

Boaz Leskes

January 13, 2016
Tweet

Transcript

  1. 2.

    2 A Cluster node4 node1 a0 b1 b2 node3 b1

    a0 node2 a1 b0 b0 a1 b2 Primary Replica
  2. 4.

    4 Index Creation node4 node1 a0 b1 b2 node3 b1

    a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0 PUT index_name/type/1
  3. 5.

    5 Index creation - Allocation Deciders - Filtering node4 (type:cold)

    node1 (type:hot) a0 b1 b2 node3 (type:cold) b1 a0 node2 (type:hot) a1 b0 b0 a1 b2 unassigned c0 c0 PUT index_name/type/1 index.routing.allocation.require.type: hot
  4. 6.

    6 Index creation - Allocation Deciders - Disk Threshold node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 cluster.routing.allocation.disk.watermark.high: 90% 91% disk usage unassigned
  5. 7.

    7 Index creation - Allocation Deciders - Throttling node4 node1

    a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 throttle unassigned
  6. 8.

    8 Index creation - Primary Assigned (initializing) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 c0 unassigned
  7. 10.

    9 Index creation - Shard Initialization master node2 a1 b0

    cluster state c0 • detect assignment
  8. 11.

    9 Index creation - Shard Initialization master node2 a1 b0

    cluster state c0 • detect assignment • initialize an empty shard
  9. 12.

    9 Index creation - Shard Initialization master node2 a1 b0

    c0 shard ready • detect assignment • initialize an empty shard • notify master when done
  10. 13.

    9 Index creation - Shard Initialization master node2 a1 b0

    c0 shard ready • detect assignment • initialize an empty shard • notify master when done • mark shard as started
  11. 14.

    9 Index creation - Shard Initialization master node2 a1 b0

    c0 cluster state • detect assignment • initialize an empty shard • notify master when done • mark shard as started
  12. 15.

    9 Index creation - Shard Initialization master node2 a1 b0

    cluster state • detect assignment • initialize an empty shard • notify master when done • mark shard as started • activate the shard c0
  13. 16.

    10 Index creation - Primary Assigned (started) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0
  14. 17.

    10 Index creation - Primary Assigned (started) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0
  15. 18.

    10 Index creation - Primary Assigned (started) node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0 c0
  16. 20.

    11 Index creation - Replica Initialization master node2 a1 b0

    c0 • detect assignment node1 a1 b0 cluster state c0
  17. 21.

    11 Index creation - Replica Initialization master node2 a1 b0

    c0 • detect assignment • start recovery from primary node1 a1 b0 cluster state c0
  18. 22.

    11 Index creation - Replica Initialization master node2 a1 b0

    c0 shard ready • detect assignment • start recovery from primary • notify master when done node1 a1 b0 c0
  19. 23.

    11 Index creation - Replica Initialization master node2 a1 b0

    c0 shard ready • detect assignment • start recovery from primary • notify master when done • mark replica as started node1 a1 b0 c0
  20. 24.

    11 Index creation - Replica Initialization master node2 a1 b0

    c0 cluster state • detect assignment • start recovery from primary • notify master when done • mark replica as started node1 a1 b0 c0
  21. 25.

    11 Index creation - Replica Initialization master node2 a1 b0

    cluster state • detect assignment • start recovery from primary • notify master when done • mark replica as started • activate the replica c0 node1 a1 b0 c0
  22. 26.

    12 Time to move a shard node4 node1 a0 b1

    b2 node3 b1 a0 node2 a1 b0 b0 a1 b2
  23. 27.

    13 Time to move a shard - Explicit User Command

    node4 (type:cold) node1 (type:hot) a0 b1 b2 node3 (type:cold) b1 node2 (type:hot) a1 b0 a1 b2 POST index_name/_settings index.routing.allocation.require.type: cold
  24. 28.

    14 Time to move a shard - Disk Threshold Allocation

    Decider node4 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 cluster.routing.allocation.disk.watermark.low: 85% 86% disk usage
  25. 29.

    15 Time to move a shard - Nodes Added node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  26. 30.

    15 Time to move a shard - Nodes Added node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  27. 31.

    15 Time to move a shard - Nodes Added node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0
  28. 33.

    17 Data Storage - Lucene Segments index a doc time

    lucene flush buffer index a doc buffer index a doc buffer buffer segment
  29. 34.

    18 Data Storage - Lucene Segments index a doc time

    lucene flush buffer index a doc buffer index a doc buffer buffer segment segment segment segment segment
  30. 35.

    19 Data Storage - Transaction Log index a doc time

    lucene flush buffer segment trans_log buffer trans_log buffer trans_log elasticsearch flush doc op lucene commit segment segment
  31. 36.

    20 Data Storage - Lucene Segments + Transaction Log •

    Taking the current set of segments gives a point in time snapshot of the data
 • Not flushing the translog keeps a complete operation history
  32. 38.

    22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  33. 39.

    22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5
  34. 40.

    22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0
  35. 41.

    22 Relocation - Always Copy from Primary node4 node1 a0

    b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0
  36. 42.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 cluster state lcn
  37. 43.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 cluster state • detect assignment lcn
  38. 44.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery
  39. 45.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment cluster state
  40. 46.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion
  41. 47.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene
  42. 48.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target send segments segment 4 segment 5 segment 2 segment 3 lcn
  43. 49.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4
  44. 50.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • finishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done
  45. 51.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 2 segment 3 • detect assignment • sends start recovery request lcn • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • finishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 • notifies master shard ready
  46. 52.

    23 Relocation - Recover from Primary node1 node 5 tlog

    segment 3 • detect assignment • sends start recovery request lcn • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • finishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 • notifies master • master activate shard (and removes it from node5) cluster state
  47. 53.

    24 Relocation - balance some more node4 node1 a0 b1

    b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0 b2
  48. 56.

    27 Full Cluster Restart - Master Fetches Store Information node4

    node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 unassigned
  49. 57.

    27 Full Cluster Restart - Master Fetches Store Information node4

    node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 unassigned
  50. 58.

    28 Full Cluster Restart - Allocated Existing Copy as Primary

    node4 node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 a0 unassigned
  51. 59.

    29 Full Cluster Restart - Allocated Existing Copy as Primary

    node4 node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned
  52. 60.

    30 Full Cluster Restart - Replica Allocation - Fetch Store

    node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned
  53. 61.

    30 Full Cluster Restart - Replica Allocation - Fetch Store

    node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned
  54. 62.

    31 Full Cluster Restart - Replica Allocation - Fetch Store

    node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 need a home a0 a0
  55. 63.

    32 Full Cluster Restart - Replica Allocation node4 node1 (master)

    b1 node3 b1 a0 node2 a1 b0 b0 a1 need a home a0 a0
  56. 64.

    33 Full Cluster Restart - Recover from Primary node 4

    node 1 tlog segment 2 segment 3 cluster state lcn start recovery send segments replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done cluster state Reuse existing data
  57. 65.

    34 Segments Reuse & Synced Flush segment 2 + 3

    segment 4 segment 5 + 6 Reuse existing data? Shard 1 Shard 2 segment 2 segment 3 + 4 segment 5 segment 6
  58. 66.

    35 Segments Reuse & Synced Flush segment 2 + 3

    segment 4 segment 5 + 6 automatically use inactivity periods to add a
 sync id marker, guarantying doc level equality Shard 1 Shard 2 segment 2 segment 3 + 4 segment 5 segment 6 sync_id: 0XYB321 sync_id: 0XYB321
  59. 67.

    36 Full Cluster Restart - Recover with a matching sync

    id node 4 node 1 tlog segment 2 segment 3 cluster state lcn start recovery replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done cluster state Reuse existing data! sync id
  60. 69.
  61. 70.
  62. 71.

    40 Single Node Loss - Promote Primaries and Replica node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 b1 a1 needed but potentially expensive
  63. 72.

    41 Single Node Loss - A Grace Period node4 node1

    a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 index.unassigned.node_left.delayed_timeout: 1m
  64. 73.

    42 Single Node Loss - Node Returns, shard re-assigned node4

    node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 index.unassigned.node_left.delayed_timeout: 1m
  65. 74.

    43 Single Node Loss - Node Returns After Period Expires

    (v2.0) node4 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 b1 a1 cancel recoveries if sync-flushed