Consensus and Replication in Elasticsearch

Slide 1

Slide 1 text

Elasticsearch Engineers March 8th, 2017 Consistency and Replication in Elasticsearch Boaz Leskes @bleskes Yannick Welsch @ywelsch Jason Tedor @jasontedor

Slide 2

Slide 2 text

2 The Perfect Cluster

Slide 3

Slide 3 text

• Indexes lots of data fast • Returns near real time results • Ask me everyone anything • Resilient to a reasonable amount of failures The Perfect Cluster 3

Slide 4

Slide 4 text

The Perfect Cluster 4 GET _search PUT index/type/1 { "f": "text" } PUT index/type/2 { "f": "other text" } GET _search

Slide 5

Slide 5 text

The Perfect Cluster 5 GET _search PUT index/type/1 { "f": "text" } PUT index/type/2 { "f": "other text" } GET _search

Slide 6

Slide 6 text

The Perfect Cluster 6 GET _search GET _search PUT index/type/1 { "c": 1 } PUT index/type/1 { "c": 2 }

Slide 7

Slide 7 text

The Perfect Cluster 7 GET _search GET _search PUT index/type/1 { "c": 1 } PUT index/type/2 { "c": 2 }

Slide 8

Slide 8 text

The Perfect Cluster 8 GET _search GET _search PUT index/type/1 { "c": 1 } PUT index/type/2 { "c": 2 }

Slide 9

Slide 9 text

9 Quorum based algorithms • Paxos • Viewstamped Replication • Raft • Zab

Slide 10

Slide 10 text

Quorum-based Algorithms 10 POST new_index { "settings": {} } 200 OK

Slide 11

Slide 11 text

Quorum-based Algorithms 11 200 OK POST new_index { "settings": {} } ? ?

Slide 12

Slide 12 text

A QUORUM OF 2 IS 2

Slide 13

Slide 13 text

• Good for coordinating many nodes • Good for lite reads • Requires three copies or more Summary - Quorum Based Consistency Algorithms 13 • Good for Cluster State coordination • Problematic for storing data ZenDiscovery

Slide 14

Slide 14 text

• Why do we need 3 copies exactly? • Can’t we operate without the extra copy? Retrospective - Quorums and Data 14

Slide 15

Slide 15 text

15 Primary - Backup Replication

Slide 16

Slide 16 text

• well-known model for replicating data • main copy of the data called primary • additional backup copies • Elasticsearch: • primary and replica shards • number of replicas is configurable & dynamically adjustable • fully-automated shard management Primary - Backup replication 16

Slide 17

Slide 17 text

• data flows from primary to replicas • synchronous replication • write to all / read from any • can lose all but one copy Basic flow for write requests 17 node 4 node 3 0 node 2 0 1 4 4 6 8 6 node 5 node 1 2 7 0 3 5 5

Slide 18

Slide 18 text

What if a replica is unavailable? 18 node 3 0 node 2 0 node 1 node 4 … … 1 primary, 1 replica, contain same data

Slide 19

Slide 19 text

What if a replica is unavailable? 19 node 3 0 node 2 0 node 1 node 4 … … Take down node 3 for maintenance

Slide 20

Slide 20 text

What if a replica is unavailable? 20 node 3 0 node 2 0 node 1 node 4 … … Index into primary

Slide 21

Slide 21 text

What if a replica is unavailable? 21 node 3 0 node 2 0 node 1 node 4 … … Replica misses acknowledged write

Slide 22

Slide 22 text

What if a replica is unavailable? 22 node 3 0 node 2 0 node 1 node 4 … … Node with primary crashes

Slide 23

Slide 23 text

What if a replica is unavailable? 23 node 3 0 node 2 0 node 1 node 4 … … Node with stale shard copy comes back up

Slide 24

Slide 24 text

What if a replica is unavailable? 24 node 3 node 2 0 node 1 node 4 … … 0 DATA LOSS Stale shard copy should not become primary

Slide 25

Slide 25 text

• Master is responsible for allocating shards • Decision recorded in cluster state • Broadcasted to all the nodes • Smart routing of requests based on cluster state Shard allocation basics 25

Slide 26

Slide 26 text

What if a replica is unavailable? 26 node 3 node 2 0 node 1 node 4 … … 0 node 3 0 node 2 0 node 1 node 4 … … Who has a copy of shard 0? Should I pick the copy on node 3 as primary? Choose wisely

Slide 27

Slide 27 text

27 • Allocation IDs uniquely identify shard copies • assigned by master • stored next to shard data • Master tracks subset of copies that are in-sync • persisted in cluster state • changes are backed by consensus layer In-sync allocations

Slide 28

Slide 28 text

Tracking in-sync shard copies 28 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] 1 primary, 1 replica, this time with allocation IDs in-sync: [9154f,6f91c]

Slide 29

Slide 29 text

Tracking in-sync shard copies 29 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Take down node 3 for maintenance in-sync: [9154f,6f91c]

Slide 30

Slide 30 text

Tracking in-sync shard copies 30 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Index into primary in-sync: [9154f,6f91c]

Slide 31

Slide 31 text

Tracking in-sync shard copies 31 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] remove 6f91c Master, please mark this other shard copy as stale in-sync: [9154f,6f91c]

Slide 32

Slide 32 text

Tracking in-sync shard copies 32 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] publish cluster state Everyone should know about in-sync copies in-sync: [9154f] publish cluster state

Slide 33

Slide 33 text

Tracking in-sync shard copies 33 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Acknowledge write to client in-sync: [9154f]

Slide 34

Slide 34 text

Tracking in-sync shard copies 34 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Node 2 crashes in-sync: [9154f]

Slide 35

Slide 35 text

Tracking in-sync shard copies 35 node 3 0 node 2 0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f] in-sync: [9154f] Shard copy with id 6f91c is not in the in-sync list Wiser master makes smarter decisions in-sync: [9154f]

Slide 36

Slide 36 text

• data replication based on primary-backup model • number of replicas configurable and dynamically adjustable • also works with just two shard copies • backed by cluster consensus to ensure safety • writes synchronous and go to all copies, read from any copy Summary 36

Slide 37

Slide 37 text

37 Staying in sync

Slide 38

Slide 38 text

Primary/Backup (super quick review) 38 One primary, one replica primary replica

Slide 39

Slide 39 text

Primary/Backup (super quick review) 39 Indexing request hits primary primary replica

Slide 40

Slide 40 text

Primary/Backup (super quick review) 40 After primary indexes, send replicas request primary replica

Slide 41

Slide 41 text

Primary/Backup (super quick review) 41 An offline replica will miss replica requests primary replica

Slide 42

Slide 42 text

Primary/Backup (super quick review) 42 Help, I’ve fallen out of sync and I can’t get up primary replica

Slide 43

Slide 43 text

43 Mind of their own

Slide 44

Slide 44 text

• The mechanism for peer recovery in Elasticsearch is a file-based mechanism • Upon recovery, the primary and replica compare lists of files (and checksums) • The primary sends to the replica the files that the replica is missing • When recovery completes, the two shards have the same files on disk Peer recovery 44 Bring yourself back online

Slide 45

Slide 45 text

• Shards in Elasticsearch are independent Lucene indices • This means that it is very unlikely that the files on disk are in sync, even when a replica is online • Synced flush offsets this problem for idle shards • For shards seeing active indexing, they are unlikely to be equal at the level of files so recovery is often full recovery Peer recovery 45 If you have to copy gigabytes of data, you’re going to have a bad time

Slide 46

Slide 46 text

46 A Different Approach

Slide 47

Slide 47 text

Operation-based approach 47 Stop thinking about files primary replica

Slide 48

Slide 48 text

Operation-based approach 48 Start thinking about operations primary replica

Slide 49

Slide 49 text

Operation-based approach 49 We need a reliable way to identify missing operations primary replica

Slide 50

Slide 50 text

Operation-based approach 50 Introducing sequence numbers primary replica 0 1 2 3 0 1 2 3 4

Slide 51

Slide 51 text

51 Get well soon!

Slide 52

Slide 52 text

Fast recovery 52 One primary, one replica (redux) replica 0 2 1 primary 0 2 1 3 4 4 3

Slide 53

Slide 53 text

Fast recovery 53 Replica goes offline replica 0 2 1 primary 0 2 1 3 4 4 3

Slide 54

Slide 54 text

Fast recovery 54 Indexing continues on the primary replica 0 2 1 primary 0 2 1 3 4 5 6 7 8 9 4 3

Slide 55

Slide 55 text

Fast recovery 55 The return of the replica replica 0 2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations 4 3

Slide 56

Slide 56 text

Fast recovery 56 Special delivery for Mr. Replica replica 0 2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica 4 3

Slide 57

Slide 57 text

Fast recovery 57 Recovery is complete replica 0 2 1 4 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica • Replica is now in sync with the primary 3 5 6 7 8 9

Slide 58

Slide 58 text

58 Shared history

Slide 59

Slide 59 text

History and concurrency 59 Assign sequence numbers on the primary primary replica1 replica2 0 0 0

Slide 60

Slide 60 text

History and concurrency 60 Histories develop independently on each shard primary replica1 replica2 0 1 0 0 1

Slide 61

Slide 61 text

History and concurrency 61 Concurrent requests can lead to divergent histories primary replica1 replica2 0 1 2 0 2 0 1

Slide 62

Slide 62 text

History and concurrency 62 The plot thickens replica1 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9

Slide 63

Slide 63 text

History and concurrency 63 Those who cannot remember the past are condemned to repeat it primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9

Slide 64

Slide 64 text

• We need to track what part of history is complete • The local checkpoint for a shard is the largest sequence number below which history is complete on that shard • Persisted in each Lucene commit point History and concurrency 64 Local checkpoints A local checkpoint is maintained on each shard

Slide 65

Slide 65 text

History and concurrency 65 Local checkpoints in action primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint

Slide 66

Slide 66 text

• We can not keep history forever • We need a global safety marker that tells us how far back to keep history • The primary’s knowledge of the minimum local checkpoints across all in-sync shard copies can serve as this safe point • Replicated from the primary to the replicas so each shard has local knowledge of the global checkpoint History and concurrency 66 Global checkpoints A global checkpoint is maintained on the primary

Slide 67

Slide 67 text

History and concurrency 67 Local and global checkpoints in action primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint lagging

Slide 68

Slide 68 text

History and concurrency 68 I’m rolling, I’m rolling, I’m rolling primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint rollback

Slide 69

Slide 69 text

History and concurrency 69 Re-sync from the promoted primary primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 9 local checkpoint global checkpoint 7

Slide 70

Slide 70 text

History and concurrency 70 Fill gaps under the mandate of the new primary primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica 0 2 1 3 4 9 local checkpoint global checkpoint 7 5 6 5 6 8 8 filled gaps

Slide 71

Slide 71 text

71 Wait for it

Slide 72

Slide 72 text

• Fast replica recovery (6.0.0) • Re-sync on primary failure (6.0.0) • Cross-datacenter replication (6.x) • Changes API (tbd) Higher-level features 72 Sequence numbers enables big league features

Slide 73

Slide 73 text

73 The path ahead

Slide 74

Slide 74 text

• Elasticsearch uses a quorum-based algorithm for cluster state consensus which is good for dealing with lots of nodes and light reads • Elasticsearch uses the primary/backup model for data replication which can run with one or two copies, supports concurrency, and saves on storage costs • Elasticsearch 6.0.0 will introduce sequence numbers which will enable high-level features like faster recovery, the changes API, and cross-datacenter replication Summary 74

Slide 75

Slide 75 text

75 Meta Issue

Slide 76

Slide 76 text

76 \* Index request arrives on node n with document id docId ClientRequest(n, docId) == /\ n \notin crashedNodes \* only non-crashed nodes can accept requests /\ clusterStateOnNode[n].routingTable[n] = Primary \* node believes itself to be the primary /\ LET replicas == Replicas(clusterStateOnNode[n].routingTable) primaryTerm == currentTerm[n] tlogEntry == [id |-> docId, term |-> primaryTerm, value |-> nextClientValue, pc |-> FALSE] seq == MaxSeq(tlog[n]) + 1 https://github.com/elastic/elasticsearch-tla and https://lamport.azurewebsites.net/tla/tla.html Formalizing the model

Slide 77

Slide 77 text

• Lamport, The Part-Time Parliament, ACM TCS, 1998 (https://goo.gl/w0IfsB) • Oki et al., Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems, SPDC, 1988 (https://goo.gl/8ZxcNe) • Ongaro et al., In Search of an Understandable Consensus Algorithm, ATC, 2014   (https://goo.gl/L5hXSh) • Junqueira et al., Zab: High-performance broadcast for primary-backup systems, DSN, 2011 (https://goo.gl/CxtesU) • Lin et al., PacificA: Replication in Log-Based Distributed Storage Systems, MRTR, 2008 (https:// goo.gl/iW5FL3) • Alsberg et al., A Principle for Resilient Sharing of Distributed Resources, ICSE, 1976   (https://goo.gl/5GmPFN) References 77

Slide 78

Slide 78 text

• There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery • And hiring people! • We’re looking: https://goo.gl/VjiNjZ Join us! 78

Slide 79

Slide 79 text

79 More Questions? Visit us at the AMA