Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Consensus and Replication in Elasticsearch

Elastic Co
March 08, 2017

Consensus and Replication in Elasticsearch

Consensus algorithms are foundational to distributed systems. Choosing among Paxos and its many variants ultimately determines the performance and fault-tolerance of the underlying system. Boaz, Jason, and Yannick will discuss the basic mechanics of quorum-based consensus algorithms as well as their tradeoffs compared to the primary-backup approach – both of which are used by Elasticsearch. They will show how these two layers work together to facilitate cluster state changes and data replication while guaranteeing speed and safety. They will finish with a deep dive into the data replication layer and the recent addition of sequence numbers, which are the foundation of faster operation-based recoveries and cross-data-center replication.

Boaz Leskes l Software Engineer l Elastic
Jason Tedor l Software Engineer l Elastic
Yannick Welsch l Software Engineer l Elastic

Elastic Co

March 08, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. Elasticsearch Engineers
    March 8th, 2017
    Consistency and Replication in Elasticsearch
    Boaz Leskes
    @bleskes
    Yannick Welsch
    @ywelsch
    Jason Tedor
    @jasontedor

    View full-size slide

  2. 2
    The Perfect Cluster

    View full-size slide

  3. • Indexes lots of data fast
    • Returns near real time results
    • Ask me everyone anything
    • Resilient to a reasonable
    amount of failures
    The Perfect Cluster
    3

    View full-size slide

  4. The Perfect Cluster
    4
    GET _search
    PUT index/type/1
    {
    "f": "text"
    }
    PUT index/type/2
    {
    "f": "other text"
    }
    GET _search

    View full-size slide

  5. The Perfect Cluster
    5
    GET _search
    PUT index/type/1
    {
    "f": "text"
    }
    PUT index/type/2
    {
    "f": "other text"
    }
    GET _search

    View full-size slide

  6. The Perfect Cluster
    6
    GET _search GET _search
    PUT index/type/1
    {
    "c": 1
    }
    PUT index/type/1
    {
    "c": 2
    }

    View full-size slide

  7. The Perfect Cluster
    7
    GET _search GET _search
    PUT index/type/1
    {
    "c": 1
    }
    PUT index/type/2
    {
    "c": 2
    }

    View full-size slide

  8. The Perfect Cluster
    8
    GET _search GET _search
    PUT index/type/1
    {
    "c": 1
    }
    PUT index/type/2
    {
    "c": 2
    }

    View full-size slide

  9. 9
    Quorum based algorithms
    • Paxos
    • Viewstamped Replication
    • Raft
    • Zab

    View full-size slide

  10. Quorum-based Algorithms
    10
    POST new_index
    {
    "settings": {}
    }
    200 OK

    View full-size slide

  11. Quorum-based Algorithms
    11
    200 OK
    POST new_index
    {
    "settings": {}
    }
    ?
    ?

    View full-size slide

  12. A QUORUM OF 2 IS 2

    View full-size slide

  13. • Good for coordinating many nodes
    • Good for lite reads
    • Requires three copies or more
    Summary - Quorum Based Consistency Algorithms
    13
    • Good for Cluster State coordination
    • Problematic for storing data
    ZenDiscovery

    View full-size slide

  14. • Why do we need 3 copies exactly?
    • Can’t we operate without the extra copy?
    Retrospective - Quorums and Data
    14

    View full-size slide

  15. 15
    Primary - Backup
    Replication

    View full-size slide

  16. • well-known model for replicating data
    • main copy of the data called primary
    • additional backup copies
    • Elasticsearch:
    • primary and replica shards
    • number of replicas is configurable & dynamically adjustable
    • fully-automated shard management
    Primary - Backup replication
    16

    View full-size slide

  17. • data flows from primary to
    replicas
    • synchronous replication
    • write to all / read from any
    • can lose all but one copy
    Basic flow for write requests
    17
    node 4
    node 3
    0
    node 2
    0
    1
    4
    4
    6
    8
    6
    node 5
    node 1
    2
    7
    0
    3
    5
    5

    View full-size slide

  18. What if a replica is unavailable?
    18
    node 3
    0
    node 2
    0
    node 1 node 4


    1 primary, 1 replica, contain same data

    View full-size slide

  19. What if a replica is unavailable?
    19
    node 3
    0
    node 2
    0
    node 1 node 4


    Take down node 3 for maintenance

    View full-size slide

  20. What if a replica is unavailable?
    20
    node 3
    0
    node 2
    0
    node 1 node 4


    Index into primary

    View full-size slide

  21. What if a replica is unavailable?
    21
    node 3
    0
    node 2
    0
    node 1 node 4


    Replica misses acknowledged write

    View full-size slide

  22. What if a replica is unavailable?
    22
    node 3
    0
    node 2
    0
    node 1 node 4


    Node with primary crashes

    View full-size slide

  23. What if a replica is unavailable?
    23
    node 3
    0
    node 2
    0
    node 1 node 4


    Node with stale shard copy comes back up

    View full-size slide

  24. What if a replica is unavailable?
    24
    node 3
    node 2
    0
    node 1 node 4


    0
    DATA LOSS
    Stale shard copy should not become primary

    View full-size slide

  25. • Master is responsible for allocating shards
    • Decision recorded in cluster state
    • Broadcasted to all the nodes
    • Smart routing of requests based on cluster state
    Shard allocation basics
    25

    View full-size slide

  26. What if a replica is unavailable?
    26
    node 3
    node 2
    0
    node 1 node 4


    0
    node 3
    0
    node 2
    0
    node 1 node 4



    Who has a copy of
    shard 0?
    Should I pick
    the copy on node 3
    as primary?
    Choose wisely

    View full-size slide

  27. 27
    • Allocation IDs uniquely identify
    shard copies
    • assigned by master
    • stored next to shard data
    • Master tracks subset of copies
    that are in-sync
    • persisted in cluster state
    • changes are backed by
    consensus layer
    In-sync allocations

    View full-size slide

  28. Tracking in-sync shard copies
    28
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c]
    1 primary, 1 replica, this time with allocation IDs
    in-sync: [9154f,6f91c]

    View full-size slide

  29. Tracking in-sync shard copies
    29
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c]
    Take down node 3 for maintenance
    in-sync: [9154f,6f91c]

    View full-size slide

  30. Tracking in-sync shard copies
    30
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c]
    Index into primary
    in-sync: [9154f,6f91c]

    View full-size slide

  31. Tracking in-sync shard copies
    31
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f,6f91c] in-sync: [9154f,6f91c] in-sync: [9154f,6f91c]
    remove 6f91c
    Master, please mark this other shard copy as stale
    in-sync: [9154f,6f91c]

    View full-size slide

  32. Tracking in-sync shard copies
    32
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f]
    publish cluster state
    Everyone should know about in-sync copies
    in-sync: [9154f]
    publish cluster state

    View full-size slide

  33. Tracking in-sync shard copies
    33
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f]
    Acknowledge write to client
    in-sync: [9154f]

    View full-size slide

  34. Tracking in-sync shard copies
    34
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f] in-sync: [9154f,6f91c] in-sync: [9154f]
    Node 2 crashes
    in-sync: [9154f]

    View full-size slide

  35. Tracking in-sync shard copies
    35
    node 3
    0
    node 2
    0
    node 1 node 4



    9154f 6f91c
    in-sync: [9154f] in-sync: [9154f] in-sync: [9154f]
    Shard copy with
    id 6f91c is not in the
    in-sync list
    Wiser master makes smarter decisions
    in-sync: [9154f]

    View full-size slide

  36. • data replication based on primary-backup model
    • number of replicas configurable and dynamically adjustable
    • also works with just two shard copies
    • backed by cluster consensus to ensure safety
    • writes synchronous and go to all copies, read from any copy
    Summary
    36

    View full-size slide

  37. 37
    Staying in sync

    View full-size slide

  38. Primary/Backup (super quick review)
    38
    One primary, one replica
    primary replica

    View full-size slide

  39. Primary/Backup (super quick review)
    39
    Indexing request hits primary
    primary replica

    View full-size slide

  40. Primary/Backup (super quick review)
    40
    After primary indexes, send replicas request
    primary replica

    View full-size slide

  41. Primary/Backup (super quick review)
    41
    An offline replica will miss replica requests
    primary replica

    View full-size slide

  42. Primary/Backup (super quick review)
    42
    Help, I’ve fallen out of sync and I can’t get up
    primary replica

    View full-size slide

  43. 43
    Mind of their own

    View full-size slide

  44. • The mechanism for peer recovery in Elasticsearch is a file-based mechanism
    • Upon recovery, the primary and replica compare lists of files (and checksums)
    • The primary sends to the replica the files that the replica is missing
    • When recovery completes, the two shards have the same files on disk
    Peer recovery
    44
    Bring yourself back online

    View full-size slide

  45. • Shards in Elasticsearch are independent Lucene indices
    • This means that it is very unlikely that the files on disk are in sync, even when a replica is online
    • Synced flush offsets this problem for idle shards
    • For shards seeing active indexing, they are unlikely to be equal at the level of files so recovery is often full recovery
    Peer recovery
    45
    If you have to copy gigabytes of data, you’re going to have a bad time

    View full-size slide

  46. 46
    A Different
    Approach

    View full-size slide

  47. Operation-based approach
    47
    Stop thinking about files
    primary replica

    View full-size slide

  48. Operation-based approach
    48
    Start thinking about operations
    primary replica

    View full-size slide

  49. Operation-based approach
    49
    We need a reliable way to identify missing operations
    primary replica

    View full-size slide

  50. Operation-based approach
    50
    Introducing sequence numbers
    primary replica
    0
    1
    2
    3
    0
    1
    2
    3
    4

    View full-size slide

  51. 51
    Get well soon!

    View full-size slide

  52. Fast recovery
    52
    One primary, one replica (redux)
    replica
    0
    2
    1
    primary
    0
    2
    1
    3
    4 4
    3

    View full-size slide

  53. Fast recovery
    53
    Replica goes offline
    replica
    0
    2
    1
    primary
    0
    2
    1
    3
    4 4
    3

    View full-size slide

  54. Fast recovery
    54
    Indexing continues on the primary
    replica
    0
    2
    1
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    4
    3

    View full-size slide

  55. Fast recovery
    55
    The return of the replica
    replica
    0
    2
    1
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    • Replica and primary compare
    operations
    4
    3

    View full-size slide

  56. Fast recovery
    56
    Special delivery for Mr. Replica
    replica
    0
    2
    1
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    • Replica and primary compare
    operations
    • Primary sends all missing operations to
    the replica
    4
    3

    View full-size slide

  57. Fast recovery
    57
    Recovery is complete
    replica
    0
    2
    1
    4
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    • Replica and primary compare
    operations
    • Primary sends all missing operations to
    the replica
    • Replica is now in sync with the primary
    3
    5
    6
    7
    8
    9

    View full-size slide

  58. 58
    Shared history

    View full-size slide

  59. History and concurrency
    59
    Assign sequence numbers on the primary
    primary
    replica1
    replica2
    0
    0
    0

    View full-size slide

  60. History and concurrency
    60
    Histories develop independently on each shard
    primary
    replica1
    replica2
    0
    1
    0
    0
    1

    View full-size slide

  61. History and concurrency
    61
    Concurrent requests can lead to divergent histories
    primary
    replica1
    replica2
    0
    1
    2
    0
    2
    0
    1

    View full-size slide

  62. History and concurrency
    62
    The plot thickens
    replica1
    0
    2
    1
    3
    4
    7
    9
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    replica2
    0
    2
    1
    3
    4
    5
    6
    8
    9

    View full-size slide

  63. History and concurrency
    63
    Those who cannot remember the past are condemned to repeat it
    primary
    0
    2
    1
    3
    4
    7
    9
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    replica2
    0
    2
    1
    3
    4
    5
    6
    8
    9

    View full-size slide

  64. • We need to track what part of history is complete
    • The local checkpoint for a shard is the largest sequence number below which history is complete on that shard
    • Persisted in each Lucene commit point
    History and concurrency
    64
    Local checkpoints
    A local checkpoint is maintained on each shard

    View full-size slide

  65. History and concurrency
    65
    Local checkpoints in action
    primary
    0
    2
    1
    3
    4
    7
    9
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    replica2
    0
    2
    1
    3
    4
    5
    6
    8
    9
    local checkpoint

    View full-size slide

  66. • We can not keep history forever
    • We need a global safety marker that tells us how far back to keep history
    • The primary’s knowledge of the minimum local checkpoints across all in-sync shard copies can serve as this safe point
    • Replicated from the primary to the replicas so each shard has local knowledge of the global checkpoint
    History and concurrency
    66
    Global checkpoints
    A global checkpoint is maintained on the primary

    View full-size slide

  67. History and concurrency
    67
    Local and global checkpoints in action
    primary
    0
    2
    1
    3
    4
    7
    9
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    replica2
    0
    2
    1
    3
    4
    5
    6
    8
    9
    local checkpoint
    global checkpoint
    lagging

    View full-size slide

  68. History and concurrency
    68
    I’m rolling, I’m rolling, I’m rolling
    primary
    0
    2
    1
    3
    4
    7
    9
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    replica2
    0
    2
    1
    3
    4
    5
    6
    8
    9
    local checkpoint
    global checkpoint
    rollback

    View full-size slide

  69. History and concurrency
    69
    Re-sync from the promoted primary
    primary
    0
    2
    1
    3
    4
    7
    9
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    replica2
    0
    2
    1
    3
    4
    9
    local checkpoint
    global checkpoint
    7

    View full-size slide

  70. History and concurrency
    70
    Fill gaps under the mandate of the new primary
    primary
    0
    2
    1
    3
    4
    7
    9
    primary
    0
    2
    1
    3
    4
    5
    6
    7
    8
    9
    replica
    0
    2
    1
    3
    4
    9
    local checkpoint
    global checkpoint
    7
    5
    6
    5
    6
    8
    8
    filled gaps

    View full-size slide

  71. 71
    Wait for it

    View full-size slide

  72. • Fast replica recovery (6.0.0)
    • Re-sync on primary failure (6.0.0)
    • Cross-datacenter replication (6.x)
    • Changes API (tbd)
    Higher-level features
    72
    Sequence numbers enables big league features

    View full-size slide

  73. 73
    The path ahead

    View full-size slide

  74. • Elasticsearch uses a quorum-based algorithm for cluster state
    consensus which is good for dealing with lots of nodes and light
    reads
    • Elasticsearch uses the primary/backup model for data replication
    which can run with one or two copies, supports concurrency, and
    saves on storage costs
    • Elasticsearch 6.0.0 will introduce sequence numbers which will
    enable high-level features like faster recovery, the changes API,
    and cross-datacenter replication
    Summary
    74

    View full-size slide

  75. 75
    Meta Issue

    View full-size slide

  76. 76
    \* Index request arrives on node n with document id docId
    ClientRequest(n, docId) ==
    /\ n \notin crashedNodes \* only non-crashed nodes can accept requests
    /\ clusterStateOnNode[n].routingTable[n] = Primary \* node believes itself to be the primary
    /\ LET
    replicas == Replicas(clusterStateOnNode[n].routingTable)
    primaryTerm == currentTerm[n]
    tlogEntry == [id |-> docId,
    term |-> primaryTerm,
    value |-> nextClientValue,
    pc |-> FALSE]
    seq == MaxSeq(tlog[n]) + 1
    https://github.com/elastic/elasticsearch-tla and https://lamport.azurewebsites.net/tla/tla.html
    Formalizing the model

    View full-size slide

  77. • Lamport, The Part-Time Parliament, ACM TCS, 1998 (https://goo.gl/w0IfsB)
    • Oki et al., Viewstamped Replication: A New Primary Copy Method to Support Highly-Available
    Distributed Systems, SPDC, 1988 (https://goo.gl/8ZxcNe)
    • Ongaro et al., In Search of an Understandable Consensus Algorithm, ATC, 2014 

    (https://goo.gl/L5hXSh)
    • Junqueira et al., Zab: High-performance broadcast for primary-backup systems, DSN, 2011
    (https://goo.gl/CxtesU)
    • Lin et al., PacificA: Replication in Log-Based Distributed Storage Systems, MRTR, 2008 (https://
    goo.gl/iW5FL3)
    • Alsberg et al., A Principle for Resilient Sharing of Distributed Resources, ICSE, 1976 

    (https://goo.gl/5GmPFN)
    References
    77

    View full-size slide

  78. • There are only two hard problems in distributed systems: 2.
    Exactly-once delivery 1. Guaranteed order of messages 2.
    Exactly-once delivery
    • And hiring people!
    • We’re looking: https://goo.gl/VjiNjZ
    Join us!
    78

    View full-size slide

  79. 79
    More Questions?
    Visit us at the AMA

    View full-size slide