$30 off During Our Annual Pro Sale. View Details »

Every Shard Deserves a Home - Shard Allocation in Elasticsearch

Boaz Leskes
January 13, 2016

Every Shard Deserves a Home - Shard Allocation in Elasticsearch

This talk was given at the Elastic meetup in Tel Aviv. The talk is about the journey of a shard in Elasticsearch. It will cover the mechanics and the reasons for how ES decides to allocate shards to nodes and how those decisions are executed.

We will start with the assignment of new shards, conforming to the current Allocation Filtering, disk space usage and other factors. Once shards are started, they may be needed to be moved around. This can be due to a cluster topology change, data volumes or when someone instructed ES to so through the index management APIs.

We will cover these triggers and how ES executes on those decisions, moving potentially tens of gigabytes of data from one node to another, without dropping any search or an indexing request. We will finalize with the mechanics of full/rolling cluster restarts and the recent improvements such as synced flush, delayed assignment on node leave and cancelling ongoing relocations if a perfect match is found.

Boaz Leskes

January 13, 2016
Tweet

More Decks by Boaz Leskes

Other Decks in Technology

Transcript

  1. Every Shard Deservers a Home
    Shard Allocation at Elasticsearch
    @bleskes
    Boaz Leskes

    View Slide

  2. 2
    A Cluster
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    Primary
    Replica

    View Slide

  3. 3
    Index Creation

    View Slide

  4. 4
    Index Creation
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    unassigned
    c0
    c0
    PUT index_name/type/1

    View Slide

  5. 5
    Index creation - Allocation Deciders - Filtering
    node4 (type:cold)
    node1 (type:hot)
    a0 b1
    b2
    node3 (type:cold)
    b1 a0
    node2 (type:hot)
    a1 b0
    b0
    a1
    b2
    unassigned
    c0
    c0
    PUT index_name/type/1 index.routing.allocation.require.type: hot

    View Slide

  6. 6
    Index creation - Allocation Deciders - Disk Threshold
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    c0
    c0
    PUT index_name/type/1 cluster.routing.allocation.disk.watermark.high: 90%
    91%

    disk usage
    unassigned

    View Slide

  7. 7
    Index creation - Allocation Deciders - Throttling
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    c0
    c0
    PUT index_name/type/1

    throttle
    unassigned

    View Slide

  8. 8
    Index creation - Primary Assigned (initializing)
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    c0
    c0
    PUT index_name/type/1
    c0
    unassigned

    View Slide

  9. 9
    Index creation - Shard Initialization
    master node2
    a1 b0
    cluster state
    c0

    View Slide

  10. 9
    Index creation - Shard Initialization
    master node2
    a1 b0
    cluster state
    c0
    • detect assignment

    View Slide

  11. 9
    Index creation - Shard Initialization
    master node2
    a1 b0
    cluster state
    c0
    • detect assignment
    • initialize an empty
    shard

    View Slide

  12. 9
    Index creation - Shard Initialization
    master node2
    a1 b0
    c0
    shard ready
    • detect assignment
    • initialize an empty
    shard
    • notify master when
    done

    View Slide

  13. 9
    Index creation - Shard Initialization
    master node2
    a1 b0
    c0
    shard ready
    • detect assignment
    • initialize an empty
    shard
    • notify master when
    done
    • mark shard as started

    View Slide

  14. 9
    Index creation - Shard Initialization
    master node2
    a1 b0
    c0
    cluster state
    • detect assignment
    • initialize an empty
    shard
    • notify master when
    done
    • mark shard as started

    View Slide

  15. 9
    Index creation - Shard Initialization
    master node2
    a1 b0
    cluster state
    • detect assignment
    • initialize an empty
    shard
    • notify master when
    done
    • mark shard as started • activate the shard
    c0

    View Slide

  16. 10
    Index creation - Primary Assigned (started)
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    unassigned
    c0
    c0

    View Slide

  17. 10
    Index creation - Primary Assigned (started)
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    unassigned
    c0
    c0

    View Slide

  18. 10
    Index creation - Primary Assigned (started)
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    unassigned
    c0
    c0
    c0

    View Slide

  19. 11
    Index creation - Replica Initialization
    master
    node2
    a1 b0
    c0
    node1
    a1 b0
    cluster state
    c0

    View Slide

  20. 11
    Index creation - Replica Initialization
    master
    node2
    a1 b0
    c0
    • detect assignment
    node1
    a1 b0
    cluster state
    c0

    View Slide

  21. 11
    Index creation - Replica Initialization
    master
    node2
    a1 b0
    c0
    • detect assignment
    • start recovery from
    primary
    node1
    a1 b0
    cluster state
    c0

    View Slide

  22. 11
    Index creation - Replica Initialization
    master
    node2
    a1 b0
    c0
    shard ready
    • detect assignment
    • start recovery from
    primary
    • notify master when
    done
    node1
    a1 b0
    c0

    View Slide

  23. 11
    Index creation - Replica Initialization
    master
    node2
    a1 b0
    c0
    shard ready
    • detect assignment
    • start recovery from
    primary
    • notify master when
    done
    • mark replica as started
    node1
    a1 b0
    c0

    View Slide

  24. 11
    Index creation - Replica Initialization
    master
    node2
    a1 b0
    c0
    cluster state
    • detect assignment
    • start recovery from
    primary
    • notify master when
    done
    • mark replica as started
    node1
    a1 b0
    c0

    View Slide

  25. 11
    Index creation - Replica Initialization
    master
    node2
    a1 b0
    cluster state
    • detect assignment
    • start recovery from
    primary
    • notify master when
    done
    • mark replica as started • activate the replica
    c0
    node1
    a1 b0
    c0

    View Slide

  26. 12
    Time to move a shard
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2

    View Slide

  27. 13
    Time to move a shard - Explicit User Command
    node4 (type:cold)
    node1 (type:hot)
    a0 b1
    b2
    node3 (type:cold)
    b1
    node2 (type:hot)
    a1
    b0
    a1
    b2
    POST index_name/_settings index.routing.allocation.require.type: cold

    View Slide

  28. 14
    Time to move a shard - Disk Threshold Allocation Decider
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    cluster.routing.allocation.disk.watermark.low: 85%
    86%

    disk usage

    View Slide

  29. 15
    Time to move a shard - Nodes Added
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5

    View Slide

  30. 15
    Time to move a shard - Nodes Added
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5

    View Slide

  31. 15
    Time to move a shard - Nodes Added
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5
    a0

    View Slide

  32. 16
    Shard Data Storage
    Intermezzo

    View Slide

  33. 17
    Data Storage - Lucene Segments
    index a doc
    time
    lucene flush
    buffer
    index a doc
    buffer
    index a doc
    buffer
    buffer
    segment

    View Slide

  34. 18
    Data Storage - Lucene Segments
    index a doc
    time
    lucene flush
    buffer
    index a doc
    buffer
    index a doc
    buffer
    buffer
    segment
    segment
    segment
    segment
    segment

    View Slide

  35. 19
    Data Storage - Transaction Log
    index a doc
    time
    lucene flush
    buffer
    segment
    trans_log
    buffer
    trans_log
    buffer
    trans_log
    elasticsearch flush
    doc
    op
    lucene commit
    segment
    segment

    View Slide

  36. 20
    Data Storage - Lucene Segments + Transaction Log
    • Taking the current set of segments gives a point in
    time snapshot of the data

    • Not flushing the translog keeps a complete operation
    history

    View Slide

  37. 21
    Back to Relocation

    View Slide

  38. 22
    Relocation - Always Copy from Primary
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5

    View Slide

  39. 22
    Relocation - Always Copy from Primary
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5

    View Slide

  40. 22
    Relocation - Always Copy from Primary
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5
    a0

    View Slide

  41. 22
    Relocation - Always Copy from Primary
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5
    a0

    View Slide

  42. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    cluster state
    lcn

    View Slide

  43. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    cluster state
    • detect assignment
    lcn

    View Slide

  44. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    start recovery

    View Slide

  45. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    start recovery
    • validate assignment
    cluster state

    View Slide

  46. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    start recovery
    • validate assignment
    • prevents translog deletion

    View Slide

  47. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    start recovery
    • validate assignment
    • prevents translog deletion
    • snapshots lucene

    View Slide

  48. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    start recovery
    • validate assignment
    • prevents translog deletion
    • snapshots lucene
    • sends segments to target

    send segments
    segment 4
    segment 5
    segment 2
    segment 3
    lcn

    View Slide

  49. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    start recovery
    • validate assignment
    • prevents translog deletion
    • snapshots lucene
    • sends segments to target
    • replay translog

    replay translog
    segment 4
    segment 5
    tlog
    segment 2
    segment 3
    lcn
    segment 4

    View Slide

  50. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    start recovery
    • validate assignment
    • prevents translog deletion
    • snapshots lucene
    • sends segments to target
    • replay translog
    • finishes recovery
    segment 4
    segment 5
    tlog
    segment 2
    segment 3
    lcn
    segment 4
    done

    View Slide

  51. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 2
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    • validate assignment
    • prevents translog deletion
    • snapshots lucene
    • sends segments to target
    • replay translog
    • finishes recovery
    segment 4
    segment 5
    tlog
    segment 2
    segment 3
    lcn
    segment 4
    • notifies master
    shard ready

    View Slide

  52. 23
    Relocation - Recover from Primary
    node1 node 5
    tlog
    segment 3
    • detect assignment
    • sends start recovery
    request
    lcn
    • validate assignment
    • prevents translog deletion
    • snapshots lucene
    • sends segments to target
    • replay translog
    • finishes recovery
    segment 4
    segment 5
    tlog
    segment 2
    segment 3
    lcn
    segment 4
    • notifies master
    • master activate shard
    (and removes it from
    node5)
    cluster state

    View Slide

  53. 24
    Relocation - balance some more
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    node 5
    a0
    b2

    View Slide

  54. 25
    Full Cluster Restart

    View Slide

  55. 26
    Full Cluster Restart
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2

    View Slide

  56. 27
    Full Cluster Restart - Master Fetches Store Information
    node4
    node1

    (master) a0
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    unassigned

    View Slide

  57. 27
    Full Cluster Restart - Master Fetches Store Information
    node4
    node1

    (master) a0
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    unassigned

    View Slide

  58. 28
    Full Cluster Restart - Allocated Existing Copy as Primary
    node4
    node1

    (master) a0
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    a0
    a0
    unassigned

    View Slide

  59. 29
    Full Cluster Restart - Allocated Existing Copy as Primary
    node4
    node1

    (master) a0
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    a0
    unassigned

    View Slide

  60. 30
    Full Cluster Restart - Replica Allocation - Fetch Store
    node4
    node1

    (master)
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    a0
    unassigned

    View Slide

  61. 30
    Full Cluster Restart - Replica Allocation - Fetch Store
    node4
    node1

    (master)
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    a0
    unassigned

    View Slide

  62. 31
    Full Cluster Restart - Replica Allocation - Fetch Store
    node4
    node1

    (master)
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    need a home
    a0
    a0

    View Slide

  63. 32
    Full Cluster Restart - Replica Allocation
    node4
    node1

    (master)
    b1
    node3
    b1
    a0
    node2
    a1
    b0
    b0
    a1
    need a home
    a0
    a0

    View Slide

  64. 33
    Full Cluster Restart - Recover from Primary
    node 4 node 1
    tlog
    segment 2
    segment 3
    cluster state
    lcn
    start recovery

    send segments
    replay translog
    segment 4
    segment 5
    tlog
    segment 2
    segment 3
    lcn
    segment 4
    done
    cluster state
    Reuse existing data

    View Slide

  65. 34
    Segments Reuse & Synced Flush
    segment

    2 + 3
    segment 4
    segment

    5 + 6
    Reuse existing data?
    Shard 1 Shard 2
    segment 2
    segment

    3 + 4
    segment 5
    segment 6

    View Slide

  66. 35
    Segments Reuse & Synced Flush
    segment

    2 + 3
    segment 4
    segment

    5 + 6
    automatically use inactivity periods to add a

    sync id marker, guarantying doc level equality
    Shard 1 Shard 2
    segment 2
    segment

    3 + 4
    segment 5
    segment 6
    sync_id: 0XYB321 sync_id: 0XYB321

    View Slide

  67. 36
    Full Cluster Restart - Recover with a matching sync id
    node 4 node 1
    tlog
    segment 2
    segment 3
    cluster state
    lcn
    start recovery

    replay translog
    segment 4
    segment 5
    tlog
    segment 2
    segment 3
    lcn
    segment 4
    done
    cluster state
    Reuse existing data!
    sync id

    View Slide

  68. 37
    Singe Node Loss

    View Slide

  69. 38
    Single Node Loss
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2

    View Slide

  70. 39
    Single Node Loss
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2

    View Slide

  71. 40
    Single Node Loss - Promote Primaries and Replica
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    b1
    a1
    needed but potentially expensive

    View Slide

  72. 41
    Single Node Loss - A Grace Period
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    index.unassigned.node_left.delayed_timeout: 1m

    View Slide

  73. 42
    Single Node Loss - Node Returns, shard re-assigned
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    index.unassigned.node_left.delayed_timeout: 1m

    View Slide

  74. 43
    Single Node Loss - Node Returns After Period Expires (v2.0)
    node4
    node1
    a0 b1
    b2
    node3
    b1 a0
    node2
    a1 b0
    b0
    a1
    b2
    b1
    a1
    cancel recoveries if sync-flushed

    View Slide

  75. Thank you!
    elastic.co/guide

    View Slide