Designing Concurrent Distributed Sequence Numbers for Elasticsearch

Designing Concurrent Distributed Sequence Numbers for Elasticsearch

Sequence numbers assign a unique increasing number to every document change. They lay the foundations for higher level features such as a changes stream, or bringing a lagging replica up to speed quickly. Implementing them in a distributed system implies dealing with challenges far beyond the capabilities of a simple AtomicLong. They have to be robust enough to deal with problems like faulty servers, networking issues or sudden power outages. On top of that, they need to work in the highly concurrent indexing environment of systems like Elasticsearch. This talk will take you through the journey of designing such a system.

We will start by explaining the requirements. Then we'll evaluate solutions based on existing consensus algorithms, like ZooKeeper's ZAB and Raft, and why they are (in)sufficient for the task. Next we'll consider some alternate approaches, and finally end up with our proposed solution.

You don't need to be a consensus expert to enjoy this talk. Hopefully, you will leave with a better appreciation of the complexities of distributed systems and be inspired to learn more.

Talk given at Berlin Buzzwords 2015

9a2049bf377d85f15dd1f7a3ce697a68?s=128

Boaz Leskes

June 02, 2015
Tweet

Transcript

  1. Designing Concurrent Distributed Sequence Numbers for Elasticsearch Boaz Leskes @bleskes

  2. Sequence numbers - WhyTF?

  3. Document level versioning PUT tweets/tweet/605260098835988500 {
 "created_at": "Mon Jun 01

    06:30:27 +0000 2015", "id": 605260098835988500, "text": "Looking forward for awesomeness #bbuzz”, "user": { "name": "Boaz Leskes", "screen_name": "bleskes", } } { "_index": "tweets", "_type": "tweet", "_id": "605260098835988500", "_version": 3, … }
  4. Multiple doc updates PUT tweets/tweet/605260098835988500 {
 … "text": "…", "user":

    { "name": "Boaz Leskes", "screen_name": "bleskes", } } PUT tweets/tweet/426674590560305150 {
 … "text": "…", "user": { "name": "Boaz Leskes", "screen_name": "bleskes", } }' PUT tweets/tweet/605260098835988500 {
 … "text": "…", "user": { "name": "Boaz Leskes", "screen_name": "bleskes", }, "retweet_count": 1 }
  5. Multiple doc updates - with seq# PUT tweets/tweet/605260098835988500 {
 …

    "text": "…", "user": { "name": "Boaz Leskes", "screen_name": "bleskes", } }' PUT tweets/tweet/426674590560305150 {
 … "text": "…", "user": { "name": "Boaz Leskes", "screen_name": "bleskes", } }' PUT tweets/tweet/605260098835988500 {
 … "text": "…", "user": { "name": "Boaz Leskes", "screen_name": "bleskes", }, "retweet_count": 1 } 1 2 3
  6. Sequence # == ordering of changes • meaning they can

    be sorted, shipped, replayed
  7. Primary Replica Sync 5 4 3 2 1 Primary 4

    3 2 1 Replica
  8. Primary Replica File Based Sync 5 4 3 2 1

    Primary 4 3 2 1 Replica
  9. Primary Replica File Based Sync 5 4 3 2 1

    Primary 4 3 2 1 Replica
  10. Primary Replica Seq# Based Sync 5 4 3 2 1

    Primary 4 3 2 1 Replica
  11. Indexing essentials

  12. Indexing essentials C node 1 node 2 node 3 0P

    1R 0R 1P 0R 1R
  13. Indexing essentials C node 1 node 2 node 3 0P

    1R 0R 1P 0R 1R
  14. Indexing essentials C node 1 node 2 node 3 0P

    1R 0R 1P 0R 1R
  15. Indexing essentials C node 1 node 2 node 3 0P

    1R 0R 1P 0R 1R
  16. Indexing essentials C node 1 node 2 node 3 0P

    1R 0R 1P 0R 1R
  17. = Indexing essentials C node 1 node 2 node 3

    0P 1R 0R 1P 0R 1R
  18. Concurrent Indexing Replica Replica Primary

  19. Concurrent Indexing 1 Replica 1 Replica 1 Primary

  20. Concurrent Indexing 1 Replica 1 2 Replica 1 2 Primary

  21. Concurrent Indexing 1 3 Replica 1 2 Replica 1 2

    3 Primary
  22. Concurrent Indexing 1 3 Replica 1 2 Replica

  23. Requirements • Correct :) • Fault tolerant • Support concurrency

  24. For example, Raft Consistency Algorithm

  25. Raft Consensus Algorithm • Built to be understandable • Leader

    based • Modular (election + replication) • See https://raftconsensus.github.io/ • Used by Facebook’s HBase port & Algolia for data replication
  26. Raft - appendEntries 1 2 Replica 1 2 Replica 1

    2 Primary t-1:1,t:2 t-1:1,t:2
  27. Raft - commit on quorum 1 2 Replica 1 2

    Replica 1 2 Primary t-1:1,t:2 t-1:1,t:2
  28. Raft - broadcast* commit 1 2 Replica 1 2 Replica

    1 2 Primary c=2 c=2
  29. Raft - primary failure 1 2 Replica 1 2 3

    Replica 1 2 3 Primary t-1:2,t:3 t-1:2,t:3
  30. Raft - ack on quorum 1 2 Replica 1 2

    3 Replica 1 2 3 Primary t-1:2,t:3 t-1:2,t:3 _get 3
  31. Raft - primary failure 1 2 Replica 1 2 3

    Replica 1 2 3 Primary t-1:2,t:3 t-1:2,t:3
  32. Raft - primary failure 1 2 Replica 1 2 3

    Replica t-1:2,t:3 t-1:2,t:3
  33. Raft - concurrent indexing? 1 3 Replica 1 2 Replica

    1 2 3 Primary t-1:1,t:2 t-1:2,t:3
  34. Raft • Simple to understand • Quorum means: • Lagging

    shards don’t slow down indexing
 
 but • Read visibility issues • Tolerates up to quorum - 1 failures • Needs at least 3 copies for correctness • Challenges with concurrency
  35. Master-Backup replication

  36. Master Backup Replication • Leader based • Writes to all

    copies before ack-ing. • Used by Elasticsearch, Kafka, RAMCloud (and many others)
  37. Master-Backup - indexing 1 Replica 1 Replica 1 Primary

  38. Master-Backup - indexing 1 Replica 1 Replica 1 Primary

  39. Master-Backup - concurrency/failure 1 3 Replica 1 2 Replica 1

    2 3 Primary
  40. Master-Backup - concurrency/failure 1 3 Replica 1 2 Replica

  41. Master-Backup replication • Simple to understand • Write to all

    before ack means: • No read visibility issues • Tolerates up to N-1 failures
 
 but • A lagging shard slows indexing down (until failed) • Easier to work with concurrency • Rollbacks on failure are more frequent • No clear commit point
  42. Failure, Rollback and Commitment

  43. 3 histories 5 4 3 2 1 Primary Replica 5

    4 3 2 1 Replica 5 4 3 2 1
  44. Failure, Rollback and Commitment 9 8 7 6 5 4

    3 2 1 Primary Replica 9 7 5 4 3 2 1 Replica 9 8 6 5 4 3 2 1
  45. Failure, Rollback and Commitment 9 8 7 6 5 4

    3 2 1 Primary Replica 9 7 5 4 3 2 1 Replica 9 8 6 5 4 3 2 1
  46. Primary knows what’s “safe” 9 8 7 6 5 4

    3 2 1 Primary Replica 9 7 5 4 3 2 1 Replica 9 8 6 5 4 3 2 1
  47. Replicas have a lagging “safe” point 9 8 7 6

    5 4 3 2 1 Primary Replica 9 7 5 4 3 2 1 Replica 9 8 6 5 4 3 2 1
  48. Final words • Design is pretty much nailed down •

    Working on the nitty-gritty implementation details
  49. thank you! https://elastic.co https://github.com/elastic/elasticsearch