Elasticsearch and Resiliency

‹#› Jason Tedor @jasontedor Resiliency in Elasticsearch Boaz Leskes @bleskes

‹#› noun 1. the power or ability to return to
the original form, position, etc., after being bent, compressed, or stretched; elasticity. 2. ability to recover readily from illness, depression, adversity, or the like; buoyancy re·sil·ience

https://github.com/elastic/elasticsearch/issues?q=label%3Aresiliency 3 181 ISSUES AND PULL REQUESTS

https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html 4

5 git log --after="2015-3-9" docs/resiliency/index.asciidoc

‹#› Memory Pressure 6

Disk based data structures • Doc values (Lucene’s columnar store)
• on by default from 2.0 (#10209) • Norms stay on disk in Lucene 5.3.0 • LUCENE-6504 • released in ES 2.1.0 7

Smarter algorithms • Automatic query caching • #10897 (ES 2.0)
• LUCENE-6077 • Breadth first aggregation trees • Added in #6128 (ES 1.3) • Working on automatic application #9825 8

Prevent abuse #11511 • Limit total number of hit count
#9311 (2.1.0) • Circuit breakers have been around since 1.0 • Recent additions #16011: ‒ Limit the size of single request ‒ Limit total the size of inflight concurrent request • ThreadPools hard upper bounds #15582 • Settings validation #15278 9

‹#› Long Garbage Collection, Network hiccups, Node restarts 10

Long Garbage Collection Impacts Cluster Stability 11 master node node
1s fault detection pings GC re-join

Long Garbage Collection, Network Partition & Restarts 12 master node
node there is no way to tell them apart

Temporary Node Leave 13 node4 node1 a0 b1 b2 node3
b1 a0 node2 a1 b0 b0 a1 b2

Temporary Node Leave 14 node4 node1 a0 b1 b2 node3
b1 a0 node2 a1 b0 b0 a1 b2

node4 Temporary Node Leave - Promote Primary & Add Replicas
15 b2 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b1 a1 needed but potentially expensive

node4 Temporary Node Leave - A Grace Period #11712 (1.7)
16 b2 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 index.unassigned.node_left.delayed_timeout: 1m

node4 Still Requires A Recovery When Node Re-Join 17 b2
node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1

Resync Files With Primary 18 segment 1 + 2 segment
3 segment 4 + 5 Primary segment 2 + 3 segment 1 Replica segment 4 segment 5 reuse existing segments?

Re-Sync Files With Primary & Synced Flush #10032 (1.6) 19
segment 1 + 2 segment 3 segment 4 + 5 Primary segment 2 + 3 segment 1 Replica segment 4 segment 5 automatically use inactivity periods to add a  sync id marker, guaranteeing doc level equality & instant recovery sync_id: 0XYB321 sync_id: 0XYB321

node4 Cancel Ongoing Recoveries If A Perfect Match Found #12421
(2.0) 20 b2 node1 a0 b1 b2 node3 b1 a0 node2 a1 b0 b0 a1 b1 a1

‹#› Data Durability 21

Durability 22 index a doc time lucene flush buffer index
a doc buffer index a doc buffer buffer segment

Durability 23 index a doc time lucene flush buffer segment
trans_log buffer trans_log buffer trans_log elasticsearch flush doc op lucene commit segment segment

Durability - fsync translog every 5s (1.X) 24 index a
doc buffer trans_log doc op index a doc buffer trans_log doc op Primary Replica redundancy doesn’t help if all nodes lose power

Durability - Translog fsync on every request #11011 (2.0) 25
• For low volume indexing, fsync matters less • For high volume indexing, we can amortize the costs and fsync on every bulk • Concurrent requests can share an fsync

Translog fsync concurrent requests #11011 (2.0) 26 bulk 1 bulk
2 single fsync

Durability - Recovery after Translog Crash #11143 (2.0) 27 fsync
• fsync every 5s means some ops can be partially written • 1.x had to ignore EOF exceptions, making it non-resilient to translog trim • fsync every request still suffers from this • Write a check point file every fsync #12341 2.0 { offset: 3034, ops: 1302 }.ckp

‹#› Dynamic Mapping Updates 28

Dynamic Mappings 29 curl -XPUT localhost:9200/twitter/tweet -d ' { "id":
"467495741683150848", "text": "the problem with distributed systems jokes is that you're never sure if everyone gets it" "created_on": "Sat May 17 02:44:05 +0000 2014" }' { "id": { "type" : "string" } "text": { "type" : "string" } "created_on": { "type" : "date" } }

Dynamic Mappings Optimize for Speed 1.x Concurrent indexing requests, new
field 30 master node node {"i": 1} {"i": 10}

Dynamic Mappings Optimize for Speed 1.x Assumption: there is a
"true" schema, we just need to learn it 31 master node node {"i": 1} {"i": 10} "i": int "i": int

Dynamic Mappings Optimize for Speed 1.x If you make assumptions,
you're going to have a bad time 32 master node node {"f": 1} {"f": "text"} "f": int "f": string ?

Dynamic Mappings 2.x PR #10634 Validate dynamic mapping updates on
the master node 33 master node node {"f": 1} {"f": "text"} "f": int "f": string

Dynamic Mappings 2.x PR #10634 Validate dynamic mapping updates on
the master node 34 master node node {"f": 1} {"f": "text"} "f": int "f": string ✔ ✔

‹#›

‹#› Stale shards 36

Do not promote stale shards issue #14671 Index to primary
37 node1 a0 node2 a0

Do not promote stale shards issue #14671 Primary fails, replica
is promoted 38 node1 a0 node2 a0

Do not promote stale shards issue #14671 Old primary is
now stale 39 node1 a0 node2 a0

Do not promote stale shards issue #14671 Restart cluster 40
node1 a0 node2 a0

Do not promote stale shards issue #14671 Stale shard promoted
41 node1 a0 node2 a0

‹#›

Allocate primary shards using allocation IDs #14739 Persist allocation IDs
as metadata 43 node1 a0 node2 a0 9154ff 5aa72e master a0: 9154ff, 5aa72e

Allocate primary shards using allocation IDs #14739 Index document 44
node1 a0 node2 a0 6f91cc master a0: 6f91cc 9154ff

‹#›

‹#› Isolation during indexing 46

Shard failures Index document 47 node1 a0 node2 a0 master

Shard failures Replicate 48 node1 a0 node2 a0 master

Shard failures Replication fails 49 node1 a0 node2 a0 master

Shard failures Primary notifies the master and acknowledges the write
50 node1 a0 node2 a0 master

Shard failures Master fails the shard 51 node1 a0 node2
master

Isolation during indexing issue #7572 Index document during network partition
52 node1 a0 node2 a0 master

Isolation during indexing issue #7572 Primary fails to notify master
and acknowledges the write 53 node1 a0 node2 a0 master

Isolation during indexing issue #7572 The bad replica is promoted
to primary and acknowledged writes are lost 54 node1 a0 node2 a0 master

‹#›

Waiting while failing a shard issue #14252 56 Wait for
failure publication Master left No longer the primary • Master submits cluster state update task • And waits for the update to be processed • Master responds with success or failure • Enter retry loop waiting for new cluster state with a master changed event • The same retry loop is used for any master channel failure • The old primary is failed • Retry the request on the new primary • The new primary now manages the indexing request The request is no longer acknowledged if the shard is not failed

‹#›

‹#› Join on election 58

Master Election 1.x Pings 59 master1 master2 master3

Master Election 1.x Master elected 60 master1 master2 master3

Master Election 1.x Nodes join the master 61 master1 master2
master3 Join Join

Master Election 1.x New cluster state published 62 master1 master2
master3 CS CS

Master Election 1.x (again) Pings 63 master1 master2 master3

Master Election 1.x (again) Master elected 64 master1 master2 master3

Master Election 1.x (again) Network partition 65 master1 master2 master3

Master Election 1.x (again) Pings round two 66 master1 master2
master3

Master Election 1.x (again) Dueling masters 67 master1 master2 master3

Master Election 2.x PR #12161 Pings 68 master1 master2 master3

Master Election 2.x PR #12161 Master elected…but waits 69 master1
master2 master3 ❓

Master Election 2.x PR #12161 Elected master waits for joins
70 master1 master2 master3 Join Join ❓

Master Election 2.x PR #12161 Enough nodes join, master is
elected 71 master1 master2 master3 CS CS

‹#›

‹#› Resiliency testing 73

‹#›

DiscoveryWithServiceDisruptionsIT testFailWithMinimumMasterNodesConfigured testNodesFDAfterMasterReelection testVerifyApiBlocksDuringPartition testIsolateMasterAndVerifyClusterStateConsensus testMasterNodeGCs testStaleMasterNotHijackingMajority testRejoinDocumentExistsInAllShardCopies testUnicastSinglePingResponseContainsMaster testIsolatedUnicastNodes
testClusterJoinDespiteOfPublishingIssues testSendingShardFailure testClusterFormingWithASlowNode testNodeNotReachableFromMaster testSearchWithRelocationAndSlowClusterStateProcessing testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder testIndicesDeleted Resiliency tests Service disruptions tests 75

CorruptedFileIT testCorruptFileAndRecover testCorruptPrimaryNoReplica testCorruptionOnNetworkLayerFinalizingRecovery testCorruptionOnNetworkLayer testCorruptFileThenSnapshotAndRestore testReplicaCorruption Resiliency tests Corruption
tests 76

ServiceDisruptionScheme NoOpDisruptionScheme SingleNodeDisruption SlowClusterStateProcessing BlockClusterStateProcessing LongGCDisruption IntermittentLongGCDisruption NetworkDisruption NetworkDisconnectPartition NetworkDelaysPartition
NetworkUnresponsivePartition Resiliency tests Service disruption schemes 77

‹#› User input is very important 78

Resiliency updates Where to go 79 Reporting: https://github.com/elastic/elasticsearch/issues 1 Tracking:
https://github.com/elastic/elasticsearch/issues?q=label%3Aresiliency 3 Status: https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html 4 2015: https://www.elastic.co/elasticon/2015/sf/resiliency-in-elasticsearch-and-lucene 5 Discourse: https://discuss.elastic.co/c/elasticsearch 2

Elasticsearch and Resiliency

Elasticsearch and Resiliency

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript