Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consensus and Replication in Elasticsearch

Elastic Co
March 08, 2017

Consensus and Replication in Elasticsearch

Consensus algorithms are foundational to distributed systems. Choosing among Paxos and its many variants ultimately determines the performance and fault-tolerance of the underlying system. Boaz, Jason, and Yannick will discuss the basic mechanics of quorum-based consensus algorithms as well as their tradeoffs compared to the primary-backup approach – both of which are used by Elasticsearch. They will show how these two layers work together to facilitate cluster state changes and data replication while guaranteeing speed and safety. They will finish with a deep dive into the data replication layer and the recent addition of sequence numbers, which are the foundation of faster operation-based recoveries and cross-data-center replication.

Boaz Leskes l Software Engineer l Elastic
Jason Tedor l Software Engineer l Elastic
Yannick Welsch l Software Engineer l Elastic

Elastic Co

March 08, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Elasticsearch Engineers March 8th, 2017 Consistency and Replication in Elasticsearch

    Boaz Leskes @bleskes Yannick Welsch @ywelsch Jason Tedor @jasontedor
  2. • Indexes lots of data fast • Returns near real

    time results • Ask me everyone anything • Resilient to a reasonable amount of failures The Perfect Cluster 3
  3. The Perfect Cluster 4 GET _search PUT index/type/1 { "f":

    "text" } PUT index/type/2 { "f": "other text" } GET _search
  4. The Perfect Cluster 5 GET _search PUT index/type/1 { "f":

    "text" } PUT index/type/2 { "f": "other text" } GET _search
  5. • Good for coordinating many nodes • Good for lite

    reads • Requires three copies or more Summary - Quorum Based Consistency Algorithms 13 • Good for Cluster State coordination • Problematic for storing data ZenDiscovery
  6. • Why do we need 3 copies exactly? • Can’t

    we operate without the extra copy? Retrospective - Quorums and Data 14
  7. • well-known model for replicating data • main copy of

    the data called primary • additional backup copies • Elasticsearch: • primary and replica shards • number of replicas is configurable & dynamically adjustable • fully-automated shard management Primary - Backup replication 16
  8. • data flows from primary to replicas • synchronous replication

    • write to all / read from any • can lose all but one copy Basic flow for write requests 17 node 4 node 3 0 node 2 0 1 4 4 6 8 6 node 5 node 1 2 7 0 3 5 5
  9. What if a replica is unavailable? 18 node 3 0

    node 2 0 node 1 node 4 … … 1 primary, 1 replica, contain same data
  10. What if a replica is unavailable? 19 node 3 0

    node 2 0 node 1 node 4 … … Take down node 3 for maintenance
  11. What if a replica is unavailable? 20 node 3 0

    node 2 0 node 1 node 4 … … Index into primary
  12. What if a replica is unavailable? 21 node 3 0

    node 2 0 node 1 node 4 … … Replica misses acknowledged write
  13. What if a replica is unavailable? 22 node 3 0

    node 2 0 node 1 node 4 … … Node with primary crashes
  14. What if a replica is unavailable? 23 node 3 0

    node 2 0 node 1 node 4 … … Node with stale shard copy comes back up
  15. What if a replica is unavailable? 24 node 3 node

    2 0 node 1 node 4 … … 0 DATA LOSS Stale shard copy should not become primary
  16. • Master is responsible for allocating shards • Decision recorded

    in cluster state • Broadcasted to all the nodes • Smart routing of requests based on cluster state Shard allocation basics 25
  17. What if a replica is unavailable? 26 node 3 node

    2 0 node 1 node 4 … … 0 node 3 0 node 2 0 node 1 node 4 … … Who has a copy of shard 0? Should I pick the copy on node 3 as primary? Choose wisely
  18. 27 • Allocation IDs uniquely identify shard copies • assigned

    by master • stored next to shard data • Master tracks subset of copies that are in-sync • persisted in cluster state • changes are backed by consensus layer In-sync allocations
  19. Tracking in-sync shard copies 28 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] 1 primary, 1 replica, this time with allocation IDs in-sync: [9154f,6f91c]
  20. Tracking in-sync shard copies 29 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Take down node 3 for maintenance in-sync: [9154f,6f91c]
  21. Tracking in-sync shard copies 30 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Index into primary in-sync: [9154f,6f91c]
  22. Tracking in-sync shard copies 31 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] remove 6f91c Master, please mark this other shard copy as stale in-sync: [9154f,6f91c]
  23. Tracking in-sync shard copies 32 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] publish cluster state Everyone should know about in-sync copies in-sync: [9154f] publish cluster state
  24. Tracking in-sync shard copies 33 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Acknowledge write to client in-sync: [9154f]
  25. Tracking in-sync shard copies 34 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Node 2 crashes in-sync: [9154f]
  26. Tracking in-sync shard copies 35 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f] in-sync: [9154f] Shard copy with id 6f91c is not in the in-sync list Wiser master makes smarter decisions in-sync: [9154f]
  27. • data replication based on primary-backup model • number of

    replicas configurable and dynamically adjustable • also works with just two shard copies • backed by cluster consensus to ensure safety • writes synchronous and go to all copies, read from any copy Summary 36
  28. Primary/Backup (super quick review) 42 Help, I’ve fallen out of

    sync and I can’t get up primary replica
  29. • The mechanism for peer recovery in Elasticsearch is a

    file-based mechanism • Upon recovery, the primary and replica compare lists of files (and checksums) • The primary sends to the replica the files that the replica is missing • When recovery completes, the two shards have the same files on disk Peer recovery 44 Bring yourself back online
  30. • Shards in Elasticsearch are independent Lucene indices • This

    means that it is very unlikely that the files on disk are in sync, even when a replica is online • Synced flush offsets this problem for idle shards • For shards seeing active indexing, they are unlikely to be equal at the level of files so recovery is often full recovery Peer recovery 45 If you have to copy gigabytes of data, you’re going to have a bad time
  31. Fast recovery 55 The return of the replica replica 0

    2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations 4 3
  32. Fast recovery 56 Special delivery for Mr. Replica replica 0

    2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica 4 3
  33. Fast recovery 57 Recovery is complete replica 0 2 1

    4 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica • Replica is now in sync with the primary 3 5 6 7 8 9
  34. History and concurrency 61 Concurrent requests can lead to divergent

    histories primary replica1 replica2 0 1 2 0 2 0 1
  35. History and concurrency 62 The plot thickens replica1 0 2

    1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9
  36. History and concurrency 63 Those who cannot remember the past

    are condemned to repeat it primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9
  37. • We need to track what part of history is

    complete • The local checkpoint for a shard is the largest sequence number below which history is complete on that shard • Persisted in each Lucene commit point History and concurrency 64 Local checkpoints A local checkpoint is maintained on each shard
  38. History and concurrency 65 Local checkpoints in action primary 0

    2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint
  39. • We can not keep history forever • We need

    a global safety marker that tells us how far back to keep history • The primary’s knowledge of the minimum local checkpoints across all in-sync shard copies can serve as this safe point • Replicated from the primary to the replicas so each shard has local knowledge of the global checkpoint History and concurrency 66 Global checkpoints A global checkpoint is maintained on the primary
  40. History and concurrency 67 Local and global checkpoints in action

    primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint lagging
  41. History and concurrency 68 I’m rolling, I’m rolling, I’m rolling

    primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint rollback
  42. History and concurrency 69 Re-sync from the promoted primary primary

    0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 9 local checkpoint global checkpoint 7
  43. History and concurrency 70 Fill gaps under the mandate of

    the new primary primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica 0 2 1 3 4 9 local checkpoint global checkpoint 7 5 6 5 6 8 8 filled gaps
  44. • Fast replica recovery (6.0.0) • Re-sync on primary failure

    (6.0.0) • Cross-datacenter replication (6.x) • Changes API (tbd) Higher-level features 72 Sequence numbers enables big league features
  45. • Elasticsearch uses a quorum-based algorithm for cluster state consensus

    which is good for dealing with lots of nodes and light reads • Elasticsearch uses the primary/backup model for data replication which can run with one or two copies, supports concurrency, and saves on storage costs • Elasticsearch 6.0.0 will introduce sequence numbers which will enable high-level features like faster recovery, the changes API, and cross-datacenter replication Summary 74
  46. 76 \* Index request arrives on node n with document

    id docId ClientRequest(n, docId) == /\ n \notin crashedNodes \* only non-crashed nodes can accept requests /\ clusterStateOnNode[n].routingTable[n] = Primary \* node believes itself to be the primary /\ LET replicas == Replicas(clusterStateOnNode[n].routingTable) primaryTerm == currentTerm[n] tlogEntry == [id |-> docId, term |-> primaryTerm, value |-> nextClientValue, pc |-> FALSE] seq == MaxSeq(tlog[n]) + 1 https://github.com/elastic/elasticsearch-tla and https://lamport.azurewebsites.net/tla/tla.html Formalizing the model
  47. • Lamport, The Part-Time Parliament, ACM TCS, 1998 (https://goo.gl/w0IfsB) •

    Oki et al., Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems, SPDC, 1988 (https://goo.gl/8ZxcNe) • Ongaro et al., In Search of an Understandable Consensus Algorithm, ATC, 2014 
 (https://goo.gl/L5hXSh) • Junqueira et al., Zab: High-performance broadcast for primary-backup systems, DSN, 2011 (https://goo.gl/CxtesU) • Lin et al., PacificA: Replication in Log-Based Distributed Storage Systems, MRTR, 2008 (https:// goo.gl/iW5FL3) • Alsberg et al., A Principle for Resilient Sharing of Distributed Resources, ICSE, 1976 
 (https://goo.gl/5GmPFN) References 77
  48. • There are only two hard problems in distributed systems:

    2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery • And hiring people! • We’re looking: https://goo.gl/VjiNjZ Join us! 78