$30 off During Our Annual Pro Sale. View Details »

Consensus and Replication in Elasticsearch

Elastic Co
March 08, 2017

Consensus and Replication in Elasticsearch

Consensus algorithms are foundational to distributed systems. Choosing among Paxos and its many variants ultimately determines the performance and fault-tolerance of the underlying system. Boaz, Jason, and Yannick will discuss the basic mechanics of quorum-based consensus algorithms as well as their tradeoffs compared to the primary-backup approach – both of which are used by Elasticsearch. They will show how these two layers work together to facilitate cluster state changes and data replication while guaranteeing speed and safety. They will finish with a deep dive into the data replication layer and the recent addition of sequence numbers, which are the foundation of faster operation-based recoveries and cross-data-center replication.

Boaz Leskes l Software Engineer l Elastic
Jason Tedor l Software Engineer l Elastic
Yannick Welsch l Software Engineer l Elastic

Elastic Co

March 08, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Elasticsearch Engineers March 8th, 2017 Consistency and Replication in Elasticsearch

    Boaz Leskes @bleskes Yannick Welsch @ywelsch Jason Tedor @jasontedor
  2. 2 The Perfect Cluster

  3. • Indexes lots of data fast • Returns near real

    time results • Ask me everyone anything • Resilient to a reasonable amount of failures The Perfect Cluster 3
  4. The Perfect Cluster 4 GET _search PUT index/type/1 { "f":

    "text" } PUT index/type/2 { "f": "other text" } GET _search
  5. The Perfect Cluster 5 GET _search PUT index/type/1 { "f":

    "text" } PUT index/type/2 { "f": "other text" } GET _search
  6. The Perfect Cluster 6 GET _search GET _search PUT index/type/1

    { "c": 1 } PUT index/type/1 { "c": 2 }
  7. The Perfect Cluster 7 GET _search GET _search PUT index/type/1

    { "c": 1 } PUT index/type/2 { "c": 2 }
  8. The Perfect Cluster 8 GET _search GET _search PUT index/type/1

    { "c": 1 } PUT index/type/2 { "c": 2 }
  9. 9 Quorum based algorithms • Paxos • Viewstamped Replication •

    Raft • Zab
  10. Quorum-based Algorithms 10 POST new_index { "settings": {} } 200

    OK
  11. Quorum-based Algorithms 11 200 OK POST new_index { "settings": {}

    } ? ?
  12. A QUORUM OF 2 IS 2

  13. • Good for coordinating many nodes • Good for lite

    reads • Requires three copies or more Summary - Quorum Based Consistency Algorithms 13 • Good for Cluster State coordination • Problematic for storing data ZenDiscovery
  14. • Why do we need 3 copies exactly? • Can’t

    we operate without the extra copy? Retrospective - Quorums and Data 14
  15. 15 Primary - Backup Replication

  16. • well-known model for replicating data • main copy of

    the data called primary • additional backup copies • Elasticsearch: • primary and replica shards • number of replicas is configurable & dynamically adjustable • fully-automated shard management Primary - Backup replication 16
  17. • data flows from primary to replicas • synchronous replication

    • write to all / read from any • can lose all but one copy Basic flow for write requests 17 node 4 node 3 0 node 2 0 1 4 4 6 8 6 node 5 node 1 2 7 0 3 5 5
  18. What if a replica is unavailable? 18 node 3 0

    node 2 0 node 1 node 4 … … 1 primary, 1 replica, contain same data
  19. What if a replica is unavailable? 19 node 3 0

    node 2 0 node 1 node 4 … … Take down node 3 for maintenance
  20. What if a replica is unavailable? 20 node 3 0

    node 2 0 node 1 node 4 … … Index into primary
  21. What if a replica is unavailable? 21 node 3 0

    node 2 0 node 1 node 4 … … Replica misses acknowledged write
  22. What if a replica is unavailable? 22 node 3 0

    node 2 0 node 1 node 4 … … Node with primary crashes
  23. What if a replica is unavailable? 23 node 3 0

    node 2 0 node 1 node 4 … … Node with stale shard copy comes back up
  24. What if a replica is unavailable? 24 node 3 node

    2 0 node 1 node 4 … … 0 DATA LOSS Stale shard copy should not become primary
  25. • Master is responsible for allocating shards • Decision recorded

    in cluster state • Broadcasted to all the nodes • Smart routing of requests based on cluster state Shard allocation basics 25
  26. What if a replica is unavailable? 26 node 3 node

    2 0 node 1 node 4 … … 0 node 3 0 node 2 0 node 1 node 4 … … Who has a copy of shard 0? Should I pick the copy on node 3 as primary? Choose wisely
  27. 27 • Allocation IDs uniquely identify shard copies • assigned

    by master • stored next to shard data • Master tracks subset of copies that are in-sync • persisted in cluster state • changes are backed by consensus layer In-sync allocations
  28. Tracking in-sync shard copies 28 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] 1 primary, 1 replica, this time with allocation IDs in-sync: [9154f,6f91c]
  29. Tracking in-sync shard copies 29 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Take down node 3 for maintenance in-sync: [9154f,6f91c]
  30. Tracking in-sync shard copies 30 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] Index into primary in-sync: [9154f,6f91c]
  31. Tracking in-sync shard copies 31 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] remove 6f91c Master, please mark this other shard copy as stale in-sync: [9154f,6f91c]
  32. Tracking in-sync shard copies 32 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] publish cluster state Everyone should know about in-sync copies in-sync: [9154f] publish cluster state
  33. Tracking in-sync shard copies 33 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Acknowledge write to client in-sync: [9154f]
  34. Tracking in-sync shard copies 34 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f] Node 2 crashes in-sync: [9154f]
  35. Tracking in-sync shard copies 35 node 3 0 node 2

    0 node 1 node 4 … … 9154f 6f91c in-sync: [9154f] in-sync: [9154f] in-sync: [9154f] Shard copy with id 6f91c is not in the in-sync list Wiser master makes smarter decisions in-sync: [9154f]
  36. • data replication based on primary-backup model • number of

    replicas configurable and dynamically adjustable • also works with just two shard copies • backed by cluster consensus to ensure safety • writes synchronous and go to all copies, read from any copy Summary 36
  37. 37 Staying in sync

  38. Primary/Backup (super quick review) 38 One primary, one replica primary

    replica
  39. Primary/Backup (super quick review) 39 Indexing request hits primary primary

    replica
  40. Primary/Backup (super quick review) 40 After primary indexes, send replicas

    request primary replica
  41. Primary/Backup (super quick review) 41 An offline replica will miss

    replica requests primary replica
  42. Primary/Backup (super quick review) 42 Help, I’ve fallen out of

    sync and I can’t get up primary replica
  43. 43 Mind of their own

  44. • The mechanism for peer recovery in Elasticsearch is a

    file-based mechanism • Upon recovery, the primary and replica compare lists of files (and checksums) • The primary sends to the replica the files that the replica is missing • When recovery completes, the two shards have the same files on disk Peer recovery 44 Bring yourself back online
  45. • Shards in Elasticsearch are independent Lucene indices • This

    means that it is very unlikely that the files on disk are in sync, even when a replica is online • Synced flush offsets this problem for idle shards • For shards seeing active indexing, they are unlikely to be equal at the level of files so recovery is often full recovery Peer recovery 45 If you have to copy gigabytes of data, you’re going to have a bad time
  46. 46 A Different Approach

  47. Operation-based approach 47 Stop thinking about files primary replica

  48. Operation-based approach 48 Start thinking about operations primary replica

  49. Operation-based approach 49 We need a reliable way to identify

    missing operations primary replica
  50. Operation-based approach 50 Introducing sequence numbers primary replica 0 1

    2 3 0 1 2 3 4
  51. 51 Get well soon!

  52. Fast recovery 52 One primary, one replica (redux) replica 0

    2 1 primary 0 2 1 3 4 4 3
  53. Fast recovery 53 Replica goes offline replica 0 2 1

    primary 0 2 1 3 4 4 3
  54. Fast recovery 54 Indexing continues on the primary replica 0

    2 1 primary 0 2 1 3 4 5 6 7 8 9 4 3
  55. Fast recovery 55 The return of the replica replica 0

    2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations 4 3
  56. Fast recovery 56 Special delivery for Mr. Replica replica 0

    2 1 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica 4 3
  57. Fast recovery 57 Recovery is complete replica 0 2 1

    4 primary 0 2 1 3 4 5 6 7 8 9 • Replica and primary compare operations • Primary sends all missing operations to the replica • Replica is now in sync with the primary 3 5 6 7 8 9
  58. 58 Shared history

  59. History and concurrency 59 Assign sequence numbers on the primary

    primary replica1 replica2 0 0 0
  60. History and concurrency 60 Histories develop independently on each shard

    primary replica1 replica2 0 1 0 0 1
  61. History and concurrency 61 Concurrent requests can lead to divergent

    histories primary replica1 replica2 0 1 2 0 2 0 1
  62. History and concurrency 62 The plot thickens replica1 0 2

    1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9
  63. History and concurrency 63 Those who cannot remember the past

    are condemned to repeat it primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9
  64. • We need to track what part of history is

    complete • The local checkpoint for a shard is the largest sequence number below which history is complete on that shard • Persisted in each Lucene commit point History and concurrency 64 Local checkpoints A local checkpoint is maintained on each shard
  65. History and concurrency 65 Local checkpoints in action primary 0

    2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint
  66. • We can not keep history forever • We need

    a global safety marker that tells us how far back to keep history • The primary’s knowledge of the minimum local checkpoints across all in-sync shard copies can serve as this safe point • Replicated from the primary to the replicas so each shard has local knowledge of the global checkpoint History and concurrency 66 Global checkpoints A global checkpoint is maintained on the primary
  67. History and concurrency 67 Local and global checkpoints in action

    primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint lagging
  68. History and concurrency 68 I’m rolling, I’m rolling, I’m rolling

    primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 5 6 8 9 local checkpoint global checkpoint rollback
  69. History and concurrency 69 Re-sync from the promoted primary primary

    0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica2 0 2 1 3 4 9 local checkpoint global checkpoint 7
  70. History and concurrency 70 Fill gaps under the mandate of

    the new primary primary 0 2 1 3 4 7 9 primary 0 2 1 3 4 5 6 7 8 9 replica 0 2 1 3 4 9 local checkpoint global checkpoint 7 5 6 5 6 8 8 filled gaps
  71. 71 Wait for it

  72. • Fast replica recovery (6.0.0) • Re-sync on primary failure

    (6.0.0) • Cross-datacenter replication (6.x) • Changes API (tbd) Higher-level features 72 Sequence numbers enables big league features
  73. 73 The path ahead

  74. • Elasticsearch uses a quorum-based algorithm for cluster state consensus

    which is good for dealing with lots of nodes and light reads • Elasticsearch uses the primary/backup model for data replication which can run with one or two copies, supports concurrency, and saves on storage costs • Elasticsearch 6.0.0 will introduce sequence numbers which will enable high-level features like faster recovery, the changes API, and cross-datacenter replication Summary 74
  75. 75 Meta Issue

  76. 76 \* Index request arrives on node n with document

    id docId ClientRequest(n, docId) == /\ n \notin crashedNodes \* only non-crashed nodes can accept requests /\ clusterStateOnNode[n].routingTable[n] = Primary \* node believes itself to be the primary /\ LET replicas == Replicas(clusterStateOnNode[n].routingTable) primaryTerm == currentTerm[n] tlogEntry == [id |-> docId, term |-> primaryTerm, value |-> nextClientValue, pc |-> FALSE] seq == MaxSeq(tlog[n]) + 1 https://github.com/elastic/elasticsearch-tla and https://lamport.azurewebsites.net/tla/tla.html Formalizing the model
  77. • Lamport, The Part-Time Parliament, ACM TCS, 1998 (https://goo.gl/w0IfsB) •

    Oki et al., Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems, SPDC, 1988 (https://goo.gl/8ZxcNe) • Ongaro et al., In Search of an Understandable Consensus Algorithm, ATC, 2014 
 (https://goo.gl/L5hXSh) • Junqueira et al., Zab: High-performance broadcast for primary-backup systems, DSN, 2011 (https://goo.gl/CxtesU) • Lin et al., PacificA: Replication in Log-Based Distributed Storage Systems, MRTR, 2008 (https:// goo.gl/iW5FL3) • Alsberg et al., A Principle for Resilient Sharing of Distributed Resources, ICSE, 1976 
 (https://goo.gl/5GmPFN) References 77
  78. • There are only two hard problems in distributed systems:

    2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery • And hiring people! • We’re looking: https://goo.gl/VjiNjZ Join us! 78
  79. 79 More Questions? Visit us at the AMA