Every Shard Deserves a Home - Shard Allocation in Elasticsearch

Every Shard Deservers a Home Shard Allocation at Elasticsearch @bleskes
Boaz Leskes

2 A Cluster node4 node1 a0 b1 b2 node3 b1
a0 node2 a1 b0 b0 a1 b2 Primary Replica

3 Index Creation

4 Index Creation node4 node1 a0 b1 b2 node3 b1
a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0 PUT index_name/type/1

5 Index creation - Allocation Deciders - Filtering node4 (type:cold)
node1 (type:hot) a0 b1 b2 node3 (type:cold) b1 a0 node2 (type:hot) a1 b0 b0 a1 b2 unassigned c0 c0 PUT index_name/type/1 index.routing.allocation.require.type: hot

6 Index creation - Allocation Deciders - Disk Threshold node4
node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 cluster.routing.allocation.disk.watermark.high: 90% 91% disk usage unassigned

7 Index creation - Allocation Deciders - Throttling node4 node1
a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 throttle unassigned

8 Index creation - Primary Assigned (initializing) node4 node1 a0
b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 c0 c0 PUT index_name/type/1 c0 unassigned

9 Index creation - Shard Initialization master node2 a1 b0
cluster state c0

cluster state c0 • detect assignment

cluster state c0 • detect assignment • initialize an empty shard

c0 shard ready • detect assignment • initialize an empty shard • notify master when done

c0 shard ready • detect assignment • initialize an empty shard • notify master when done • mark shard as started

c0 cluster state • detect assignment • initialize an empty shard • notify master when done • mark shard as started

cluster state • detect assignment • initialize an empty shard • notify master when done • mark shard as started • activate the shard c0

10 Index creation - Primary Assigned (started) node4 node1 a0
b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0

10 Index creation - Primary Assigned (started) node4 node1 a0
b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 unassigned c0 c0 c0

11 Index creation - Replica Initialization master node2 a1 b0
c0 node1 a1 b0 cluster state c0

c0 • detect assignment node1 a1 b0 cluster state c0

c0 • detect assignment • start recovery from primary node1 a1 b0 cluster state c0

c0 shard ready • detect assignment • start recovery from primary • notify master when done node1 a1 b0 c0

c0 shard ready • detect assignment • start recovery from primary • notify master when done • mark replica as started node1 a1 b0 c0

c0 cluster state • detect assignment • start recovery from primary • notify master when done • mark replica as started node1 a1 b0 c0

cluster state • detect assignment • start recovery from primary • notify master when done • mark replica as started • activate the replica c0 node1 a1 b0 c0

12 Time to move a shard node4 node1 a0 b1
b2 node3 b1 a0 node2 a1 b0 b0 a1 b2

13 Time to move a shard - Explicit User Command
node4 (type:cold) node1 (type:hot) a0 b1 b2 node3 (type:cold) b1 node2 (type:hot) a1 b0 a1 b2 POST index_name/_settings index.routing.allocation.require.type: cold

14 Time to move a shard - Disk Threshold Allocation
Decider node4 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 cluster.routing.allocation.disk.watermark.low: 85% 86% disk usage

15 Time to move a shard - Nodes Added node4
node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5

15 Time to move a shard - Nodes Added node4
node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0

16 Shard Data Storage Intermezzo

17 Data Storage - Lucene Segments index a doc time
lucene flush buffer index a doc buffer index a doc buffer buffer segment

18 Data Storage - Lucene Segments index a doc time
lucene flush buffer index a doc buffer index a doc buffer buffer segment segment segment segment segment

19 Data Storage - Transaction Log index a doc time
lucene flush buffer segment trans_log buffer trans_log buffer trans_log elasticsearch flush doc op lucene commit segment segment

20 Data Storage - Lucene Segments + Transaction Log •
Taking the current set of segments gives a point in time snapshot of the data  • Not flushing the translog keeps a complete operation history

21 Back to Relocation

22 Relocation - Always Copy from Primary node4 node1 a0
b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5

22 Relocation - Always Copy from Primary node4 node1 a0
b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0

23 Relocation - Recover from Primary node1 node 5 tlog
segment 2 segment 3 cluster state lcn

segment 2 segment 3 cluster state • detect assignment lcn

segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery

segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment cluster state

segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion

segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene

segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target send segments segment 4 segment 5 segment 2 segment 3 lcn

segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4

segment 2 segment 3 • detect assignment • sends start recovery request lcn start recovery • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • ﬁnishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done

segment 2 segment 3 • detect assignment • sends start recovery request lcn • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • ﬁnishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 • notiﬁes master shard ready

segment 3 • detect assignment • sends start recovery request lcn • validate assignment • prevents translog deletion • snapshots lucene • sends segments to target • replay translog • ﬁnishes recovery segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 • notiﬁes master • master activate shard (and removes it from node5) cluster state

24 Relocation - balance some more node4 node1 a0 b1
b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 node 5 a0 b2

25 Full Cluster Restart

26 Full Cluster Restart node4 node1 a0 b1 b2 node3
b1 a0 node2 a1 b0 b0 a1 b2

27 Full Cluster Restart - Master Fetches Store Information node4
node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 unassigned

28 Full Cluster Restart - Allocated Existing Copy as Primary
node4 node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 a0 unassigned

29 Full Cluster Restart - Allocated Existing Copy as Primary
node4 node1 (master) a0 b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned

30 Full Cluster Restart - Replica Allocation - Fetch Store
node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 a0 unassigned

31 Full Cluster Restart - Replica Allocation - Fetch Store
node4 node1 (master) b1 node3 b1 a0 node2 a1 b0 b0 a1 need a home a0 a0

32 Full Cluster Restart - Replica Allocation node4 node1 (master)
b1 node3 b1 a0 node2 a1 b0 b0 a1 need a home a0 a0

33 Full Cluster Restart - Recover from Primary node 4
node 1 tlog segment 2 segment 3 cluster state lcn start recovery send segments replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done cluster state Reuse existing data

34 Segments Reuse & Synced Flush segment 2 + 3
segment 4 segment 5 + 6 Reuse existing data? Shard 1 Shard 2 segment 2 segment 3 + 4 segment 5 segment 6

35 Segments Reuse & Synced Flush segment 2 + 3
segment 4 segment 5 + 6 automatically use inactivity periods to add a  sync id marker, guarantying doc level equality Shard 1 Shard 2 segment 2 segment 3 + 4 segment 5 segment 6 sync_id: 0XYB321 sync_id: 0XYB321

36 Full Cluster Restart - Recover with a matching sync
id node 4 node 1 tlog segment 2 segment 3 cluster state lcn start recovery replay translog segment 4 segment 5 tlog segment 2 segment 3 lcn segment 4 done cluster state Reuse existing data! sync id

37 Singe Node Loss

38 Single Node Loss node4 node1 a0 b1 b2 node3

39 Single Node Loss node4 node1 a0 b1 b2 node3

40 Single Node Loss - Promote Primaries and Replica node4
node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 b1 a1 needed but potentially expensive

41 Single Node Loss - A Grace Period node4 node1
a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 index.unassigned.node_left.delayed_timeout: 1m

42 Single Node Loss - Node Returns, shard re-assigned node4
node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 index.unassigned.node_left.delayed_timeout: 1m

43 Single Node Loss - Node Returns After Period Expires
(v2.0) node4 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b2 b1 a1 cancel recoveries if sync-ﬂushed

Thank you! elastic.co/guide

Every Shard Deserves a Home - Shard Allocation ...

Every Shard Deserves a Home - Shard Allocation in Elasticsearch

More Decks by Boaz Leskes

Other Decks in Technology

Featured

Transcript