Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing Concurrent Distributed Sequence Numbers for Elasticsearch

Designing Concurrent Distributed Sequence Numbers for Elasticsearch

Sequence numbers assign a unique increasing number to every document change. They lay the foundations for higher level features such as a changes stream, or bringing a lagging replica up to speed quickly. Implementing them in a distributed system implies dealing with challenges far beyond the capabilities of a simple AtomicLong. They have to be robust enough to deal with problems like faulty servers, networking issues or sudden power outages. On top of that, they need to work in the highly concurrent indexing environment of systems like Elasticsearch. This talk will take you through the journey of designing such a system.

We will start by explaining the requirements. Then we'll evaluate solutions based on existing consensus algorithms, like ZooKeeper's ZAB and Raft, and why they are (in)sufficient for the task. Next we'll consider some alternate approaches, and finally end up with our proposed solution.

You don't need to be a consensus expert to enjoy this talk. Hopefully, you will leave with a better appreciation of the complexities of distributed systems and be inspired to learn more.

Talk given at Berlin Buzzwords 2015

Boaz Leskes

June 02, 2015
Tweet

More Decks by Boaz Leskes

Other Decks in Technology

Transcript

  1. Designing Concurrent Distributed
    Sequence Numbers for Elasticsearch
    Boaz Leskes
    @bleskes

    View full-size slide

  2. Sequence numbers - WhyTF?

    View full-size slide

  3. Document level versioning
    PUT tweets/tweet/605260098835988500
    {

    "created_at": "Mon Jun 01 06:30:27 +0000 2015",
    "id": 605260098835988500,
    "text": "Looking forward for awesomeness #bbuzz”,
    "user": {
    "name": "Boaz Leskes",
    "screen_name": "bleskes",
    }
    }
    {
    "_index": "tweets",
    "_type": "tweet",
    "_id": "605260098835988500",
    "_version": 3,

    }

    View full-size slide

  4. Multiple doc updates
    PUT tweets/tweet/605260098835988500
    {


    "text": "…",
    "user": {
    "name": "Boaz Leskes",
    "screen_name": "bleskes",
    }
    }
    PUT tweets/tweet/426674590560305150
    {


    "text": "…",
    "user": {
    "name": "Boaz Leskes",
    "screen_name": "bleskes",
    }
    }'
    PUT tweets/tweet/605260098835988500
    {


    "text": "…",
    "user": {
    "name": "Boaz Leskes",
    "screen_name": "bleskes",
    },
    "retweet_count": 1
    }

    View full-size slide

  5. Multiple doc updates - with seq#
    PUT tweets/tweet/605260098835988500
    {


    "text": "…",
    "user": {
    "name": "Boaz Leskes",
    "screen_name": "bleskes",
    }
    }'
    PUT tweets/tweet/426674590560305150
    {


    "text": "…",
    "user": {
    "name": "Boaz Leskes",
    "screen_name": "bleskes",
    }
    }'
    PUT tweets/tweet/605260098835988500
    {


    "text": "…",
    "user": {
    "name": "Boaz Leskes",
    "screen_name": "bleskes",
    },
    "retweet_count": 1
    }
    1
    2
    3

    View full-size slide

  6. Sequence # == ordering of changes
    • meaning they can be sorted, shipped, replayed

    View full-size slide

  7. Primary Replica Sync
    5
    4
    3
    2
    1
    Primary
    4
    3
    2
    1
    Replica

    View full-size slide

  8. Primary Replica File Based Sync
    5
    4
    3
    2
    1
    Primary
    4
    3
    2
    1
    Replica

    View full-size slide

  9. Primary Replica File Based Sync
    5
    4
    3
    2
    1
    Primary
    4
    3
    2
    1
    Replica

    View full-size slide

  10. Primary Replica Seq# Based Sync
    5
    4
    3
    2
    1
    Primary
    4
    3
    2
    1
    Replica

    View full-size slide

  11. Indexing essentials

    View full-size slide

  12. Indexing essentials
    C
    node 1 node 2 node 3
    0P 1R 0R 1P 0R 1R

    View full-size slide

  13. Indexing essentials
    C
    node 1 node 2 node 3
    0P 1R 0R 1P 0R 1R

    View full-size slide

  14. Indexing essentials
    C
    node 1 node 2 node 3
    0P 1R 0R 1P 0R 1R

    View full-size slide

  15. Indexing essentials
    C
    node 1 node 2 node 3
    0P 1R 0R 1P 0R 1R

    View full-size slide

  16. Indexing essentials
    C
    node 1 node 2 node 3
    0P 1R 0R 1P 0R 1R

    View full-size slide

  17. =
    Indexing essentials
    C
    node 1 node 2 node 3
    0P 1R 0R 1P 0R 1R

    View full-size slide

  18. Concurrent Indexing
    Replica
    Replica
    Primary

    View full-size slide

  19. Concurrent Indexing
    1
    Replica
    1
    Replica
    1
    Primary

    View full-size slide

  20. Concurrent Indexing
    1
    Replica
    1
    2
    Replica
    1
    2
    Primary

    View full-size slide

  21. Concurrent Indexing
    1
    3
    Replica
    1
    2
    Replica
    1
    2
    3
    Primary

    View full-size slide

  22. Concurrent Indexing
    1
    3
    Replica
    1
    2
    Replica

    View full-size slide

  23. Requirements
    • Correct :)

    • Fault tolerant

    • Support concurrency

    View full-size slide

  24. For example, Raft
    Consistency Algorithm

    View full-size slide

  25. Raft Consensus Algorithm
    • Built to be understandable

    • Leader based

    • Modular (election + replication)

    • See https://raftconsensus.github.io/

    • Used by Facebook’s HBase port &
    Algolia for data replication

    View full-size slide

  26. Raft - appendEntries
    1
    2
    Replica
    1
    2
    Replica
    1
    2
    Primary t-1:1,t:2
    t-1:1,t:2

    View full-size slide

  27. Raft - commit on quorum
    1
    2
    Replica
    1
    2
    Replica
    1
    2
    Primary t-1:1,t:2
    t-1:1,t:2

    View full-size slide

  28. Raft - broadcast* commit
    1
    2
    Replica
    1
    2
    Replica
    1
    2
    Primary
    c=2
    c=2

    View full-size slide

  29. Raft - primary failure
    1
    2
    Replica
    1
    2
    3
    Replica
    1
    2
    3
    Primary
    t-1:2,t:3
    t-1:2,t:3

    View full-size slide

  30. Raft - ack on quorum
    1
    2
    Replica
    1
    2
    3
    Replica
    1
    2
    3
    Primary
    t-1:2,t:3
    t-1:2,t:3 _get 3

    View full-size slide

  31. Raft - primary failure
    1
    2
    Replica
    1
    2
    3
    Replica
    1
    2
    3
    Primary
    t-1:2,t:3
    t-1:2,t:3

    View full-size slide

  32. Raft - primary failure
    1
    2
    Replica
    1
    2
    3
    Replica
    t-1:2,t:3
    t-1:2,t:3

    View full-size slide

  33. Raft - concurrent indexing?
    1
    3
    Replica
    1
    2
    Replica
    1
    2
    3
    Primary t-1:1,t:2
    t-1:2,t:3

    View full-size slide

  34. Raft
    • Simple to understand

    • Quorum means:

    • Lagging shards don’t slow down indexing


    but

    • Read visibility issues

    • Tolerates up to quorum - 1 failures

    • Needs at least 3 copies for correctness

    • Challenges with concurrency

    View full-size slide

  35. Master-Backup replication

    View full-size slide

  36. Master Backup Replication
    • Leader based

    • Writes to all copies before ack-ing.

    • Used by Elasticsearch, Kafka,
    RAMCloud (and many others)

    View full-size slide

  37. Master-Backup - indexing
    1
    Replica
    1
    Replica
    1
    Primary

    View full-size slide

  38. Master-Backup - indexing
    1
    Replica
    1
    Replica
    1
    Primary

    View full-size slide

  39. Master-Backup - concurrency/failure
    1
    3
    Replica
    1
    2
    Replica
    1
    2
    3
    Primary

    View full-size slide

  40. Master-Backup - concurrency/failure
    1
    3
    Replica
    1
    2
    Replica

    View full-size slide

  41. Master-Backup replication
    • Simple to understand

    • Write to all before ack means:

    • No read visibility issues

    • Tolerates up to N-1 failures


    but

    • A lagging shard slows indexing down (until failed)

    • Easier to work with concurrency

    • Rollbacks on failure are more frequent

    • No clear commit point

    View full-size slide

  42. Failure, Rollback and Commitment

    View full-size slide

  43. 3 histories
    5
    4
    3
    2
    1
    Primary Replica
    5
    4
    3
    2
    1
    Replica
    5
    4
    3
    2
    1

    View full-size slide

  44. Failure, Rollback and Commitment
    9
    8
    7
    6
    5
    4
    3
    2
    1
    Primary Replica
    9
    7
    5
    4
    3
    2
    1
    Replica
    9
    8
    6
    5
    4
    3
    2
    1

    View full-size slide

  45. Failure, Rollback and Commitment
    9
    8
    7
    6
    5
    4
    3
    2
    1
    Primary Replica
    9
    7
    5
    4
    3
    2
    1
    Replica
    9
    8
    6
    5
    4
    3
    2
    1

    View full-size slide

  46. Primary knows what’s “safe”
    9
    8
    7
    6
    5
    4
    3
    2
    1
    Primary Replica
    9
    7
    5
    4
    3
    2
    1
    Replica
    9
    8
    6
    5
    4
    3
    2
    1

    View full-size slide

  47. Replicas have a lagging “safe” point
    9
    8
    7
    6
    5
    4
    3
    2
    1
    Primary Replica
    9
    7
    5
    4
    3
    2
    1
    Replica
    9
    8
    6
    5
    4
    3
    2
    1

    View full-size slide

  48. Final words
    • Design is pretty much nailed down

    • Working on the nitty-gritty
    implementation details

    View full-size slide

  49. thank you!
    https://elastic.co
    https://github.com/elastic/elasticsearch

    View full-size slide