Consensus and Replication in Elasticsearch

Elasticsearch Engineers March 8th, 2017 Consistency and Replication in Elasticsearch
Boaz Leskes @bleskes Yannick Welsch @ywelsch Jason Tedor @jasontedor

2 The Perfect Cluster

• Indexes lots of data fast • Returns near real
time results • Ask me everyone anything • Resilient to a reasonable amount of failures The Perfect Cluster 3

The Perfect Cluster 4 GET _search PUT index/type/1 { "f":
"text" } PUT index/type/2 { "f": "other text" } GET _search

The Perfect Cluster 5 GET _search PUT index/type/1 { "f":
"text" } PUT index/type/2 { "f": "other text" } GET _search

The Perfect Cluster 6 GET _search GET _search PUT index/type/1
{ "c": 1 } PUT index/type/1 { "c": 2 }

{ "c": 1 } PUT index/type/2 { "c": 2 }

9 Quorum based algorithms • Paxos • Viewstamped Replication •
Raft • Zab

Quorum-based Algorithms 10 POST new_index { "settings": {} } 200
OK

Quorum-based Algorithms 11 200 OK POST new_index { "settings": {}
} ? ?

A QUORUM OF 2 IS 2

• Good for coordinating many nodes • Good for lite
reads • Requires three copies or more Summary - Quorum Based Consistency Algorithms 13 • Good for Cluster State coordination • Problematic for storing data ZenDiscovery

• Why do we need 3 copies exactly? • Can’t
we operate without the extra copy? Retrospective - Quorums and Data 14

15 Primary - Backup Replication

• well-known model for replicating data • main copy of
the data called primary • additional backup copies • Elasticsearch: • primary and replica shards • number of replicas is configurable & dynamically adjustable • fully-automated shard management Primary - Backup replication 16

• data flows from primary to replicas • synchronous replication
• write to all / read from any • can lose all but one copy Basic flow for write requests 17 node 4 node 3 0 node 2 0 1 4 4 6 8 6 node 5 node 1 2 7 0 3 5 5

What if a replica is unavailable? 18 node 3 0
node 2 0 node 1 node 4 … … 1 primary, 1 replica, contain same data

node 2 0 node 1 node 4 … … Take down node 3 for maintenance

node 2 0 node 1 node 4 … … Index into primary

node 2 0 node 1 node 4 … … Replica misses acknowledged write

node 2 0 node 1 node 4 … … Node with primary crashes

node 2 0 node 1 node 4 … … Node with stale shard copy comes back up

What if a replica is unavailable? 24 node 3 node
2 0 node 1 node 4 … … 0 DATA LOSS Stale shard copy should not become primary

• Master is responsible for allocating shards • Decision recorded
in cluster state • Broadcasted to all the nodes • Smart routing of requests based on cluster state Shard allocation basics 25

What if a replica is unavailable? 26 node 3 node
2 0 node 1 node 4 … … 0 node 3 0 node 2 0 node 1 node 4 … … Who has a copy of shard 0? Should I pick the copy on node 3 as primary? Choose wisely

27 • Allocation IDs uniquely identify shard copies • assigned
by master • stored next to shard data • Master tracks subset of copies that are in-sync • persisted in cluster state • changes are backed by consensus layer In-sync allocations

Tracking in-sync shard copies 28 node 3 0 node 2
0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] 1 primary, 1 replica, this time with allocation IDs in-sync: [9154f,6f91c]

0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Take down node 3 for maintenance in-sync: [9154f,6f91c]

0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Index into primary in-sync: [9154f,6f91c]

0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] remove 6f91c Master, please mark this other shard copy as stale in-sync: [9154f,6f91c]

0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] publish cluster state Everyone should know about in-sync copies in-sync: [9154f] publish cluster state

0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Acknowledge write to client in-sync: [9154f]

0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Node 2 crashes in-sync: [9154f]

0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f] in-sync: [9154f] Shard copy with id 6f91c is not in the in-sync list Wiser master makes smarter decisions in-sync: [9154f]

• data replication based on primary-backup model • number of
replicas configurable and dynamically adjustable • also works with just two shard copies • backed by cluster consensus to ensure safety • writes synchronous and go to all copies, read from any copy Summary 36

37 Staying in sync

Primary/Backup (super quick review) 38 One primary, one replica primary
replica

Primary/Backup (super quick review) 39 Indexing request hits primary primary
replica

Primary/Backup (super quick review) 40 After primary indexes, send replicas
request primary replica

Primary/Backup (super quick review) 41 An offline replica will miss
replica requests primary replica

Primary/Backup (super quick review) 42 Help, I’ve fallen out of
sync and I can’t get up primary replica

43 Mind of their own

• The mechanism for peer recovery in Elasticsearch is a
file-based mechanism • Upon recovery, the primary and replica compare lists of files (and checksums) • The primary sends to the replica the files that the replica is missing • When recovery completes, the two shards have the same files on disk Peer recovery 44 Bring yourself back online

• Shards in Elasticsearch are independent Lucene indices • This
means that it is very unlikely that the files on disk are in sync, even when a replica is online • Synced flush offsets this problem for idle shards • For shards seeing active indexing, they are unlikely to be equal at the level of files so recovery is often full recovery Peer recovery 45 If you have to copy gigabytes of data, you’re going to have a bad time

46 A Different Approach

Operation-based approach 47 Stop thinking about files primary replica

Operation-based approach 48 Start thinking about operations primary replica

Operation-based approach 49 We need a reliable way to identify
missing operations primary replica

Operation-based approach 50 Introducing sequence numbers primary replica 0 1
2 3 0 1 2 3 4

51 Get well soon!

Fast recovery 52 One primary, one replica (redux) replica 0
2 1 primary 0 2 1 3 4 4 3

Fast recovery 53 Replica goes offline replica 0 2 1
primary 0 2 1 3 4 4 3

Fast recovery 54 Indexing continues on the primary replica 0
2 1 primary 0 2 1 3 4 5 6 7 8 9 4 3

Fast recovery 55 The return of the replica replica 0
2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations 4 3

Fast recovery 56 Special delivery for Mr. Replica replica 0
2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica 4 3

Fast recovery 57 Recovery is complete replica 0 2 1
4 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica • Replica is now in sync with the primary 3 5 6 7 8 9

58 Shared history

History and concurrency 59 Assign sequence numbers on the primary
primary replica1 replica2 0 0 0

History and concurrency 60 Histories develop independently on each shard
primary replica1 replica2 0 1 0 0 1

History and concurrency 61 Concurrent requests can lead to divergent
histories primary replica1 replica2 0 1 2 0 2 0 1

History and concurrency 62 The plot thickens replica1 0 2
1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9

History and concurrency 63 Those who cannot remember the past
are condemned to repeat it primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9

• We need to track what part of history is
complete • The local checkpoint for a shard is the largest sequence number below which history is complete on that shard • Persisted in each Lucene commit point History and concurrency 64 Local checkpoints A local checkpoint is maintained on each shard

History and concurrency 65 Local checkpoints in action primary 0
2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint

• We can not keep history forever • We need
a global safety marker that tells us how far back to keep history • The primary’s knowledge of the minimum local checkpoints across all in-sync shard copies can serve as this safe point • Replicated from the primary to the replicas so each shard has local knowledge of the global checkpoint History and concurrency 66 Global checkpoints A global checkpoint is maintained on the primary

History and concurrency 67 Local and global checkpoints in action
primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint lagging

History and concurrency 68 I’m rolling, I’m rolling, I’m rolling
primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint rollback

History and concurrency 69 Re-sync from the promoted primary primary
0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 9 local checkpoint global checkpoint 7

History and concurrency 70 Fill gaps under the mandate of
the new primary primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica 0 2 1 3 4 9 local checkpoint global checkpoint 7 5 6 5 6 8 8 filled gaps

71 Wait for it

• Fast replica recovery (6.0.0) • Re-sync on primary failure
(6.0.0) • Cross-datacenter replication (6.x) • Changes API (tbd) Higher-level features 72 Sequence numbers enables big league features

73 The path ahead

• Elasticsearch uses a quorum-based algorithm for cluster state consensus
which is good for dealing with lots of nodes and light reads • Elasticsearch uses the primary/backup model for data replication which can run with one or two copies, supports concurrency, and saves on storage costs • Elasticsearch 6.0.0 will introduce sequence numbers which will enable high-level features like faster recovery, the changes API, and cross-datacenter replication Summary 74

75 Meta Issue

76 \* Index request arrives on node n with document
id docId ClientRequest(n, docId) == /\ n \notin crashedNodes \* only non-crashed nodes can accept requests /\ clusterStateOnNode[n].routingTable[n] = Primary \* node believes itself to be the primary /\ LET replicas == Replicas(clusterStateOnNode[n].routingTable) primaryTerm == currentTerm[n] tlogEntry == [id |-> docId, term |-> primaryTerm, value |-> nextClientValue, pc |-> FALSE] seq == MaxSeq(tlog[n]) + 1 https://github.com/elastic/elasticsearch-tla and https://lamport.azurewebsites.net/tla/tla.html Formalizing the model

• Lamport, The Part-Time Parliament, ACM TCS, 1998 (https://goo.gl/w0IfsB) •
Oki et al., Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems, SPDC, 1988 (https://goo.gl/8ZxcNe) • Ongaro et al., In Search of an Understandable Consensus Algorithm, ATC, 2014   (https://goo.gl/L5hXSh) • Junqueira et al., Zab: High-performance broadcast for primary-backup systems, DSN, 2011 (https://goo.gl/CxtesU) • Lin et al., PacificA: Replication in Log-Based Distributed Storage Systems, MRTR, 2008 (https:// goo.gl/iW5FL3) • Alsberg et al., A Principle for Resilient Sharing of Distributed Resources, ICSE, 1976   (https://goo.gl/5GmPFN) References 77

• There are only two hard problems in distributed systems:
2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery • And hiring people! • We’re looking: https://goo.gl/VjiNjZ Join us! 78

79 More Questions? Visit us at the AMA

Consensus and Replication in Elasticsearch

Consensus and Replication in Elasticsearch

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript