Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Diving into Elasticsearch Discovery

Diving into Elasticsearch Discovery

Shikhar Bhushan

June 02, 2015

More Decks by Shikhar Bhushan

Other Decks in Programming


  1. Diving into Elasticsearch Discovery Shikhar Bhushan at Berlin Buzzwords 2015

  2. 2

  3. 1.4M Active Sellers 20.8M Active Buyers 3 as of Mar

    31, 2015
  4. Search Infrastructure at Etsy 4 Our largest indexes are on

    Elasticsearch, ~ 1TB. Unsharded Solr Master/Slave Hand-sharded Solr Master/Slave Elasticsearch
  5. 5

  6. 6

  7. 7 “One winds on the distaff what the other spins”

    (both spread gossip) by Pieter Bruegel the Elder (cluster) http://en.wikipedia.org/wiki/Gossip_protocol
  8. 8 ?

  9. Pluggable api backwards-compatibility not guaranteed 9

  10. 10 Akka Cluster state diagram for member states eskka: elasticsearch

    discovery using akka cluster
  11. 11 * InternalClusterService *

  12. 12 transient state

  13. 13 persistent state

  14. 14 transient state

  15. node discovery leader election state publishing failure detection / handling

  16. zen & eskka: properties 16

  17. node discovery master election state publishing failure detection / handling

  18. 18 zen Unicast mode: static list of ‘gossip routers’ Multicast

    mode: multicast address Batching of state updates from membership changes (in recent releases) eskka Static list of seed nodes: ‘contact points for new nodes joining the cluster’ Batching of state updates from membership changes node discovery
  19. node discovery leader election state publishing failure detection / handling

  20. 20 zen Master-eligible node with lowest node ID Handling of

    edge cases improved in ES 1.4 (#2488) eskka Akka ‘Cluster Singleton’ - Oldest master-eligible cluster member Edge cases around fail-over handled with timeouts. leader election
  21. 21 ES 1.2 with Zen, minimum_master_nodes configured correctly, meant to

    use unicast discovery but multicast was not turned off. ?!
  22. node discovery leader election state publishing failure detection / handling

  23. 23 zen Internal ES transport Serialized & compressed Block upto

    ‘discovery.zen.publish_timeout’ (30s default) but no consequence to timeout eskka Akka Remoting Serialized, compressed & chunked Asynchronous state publishing
  24. 24

  25. node discovery leader election state publishing failure detection / handling

  26. 26 zen Master monitors all nodes with pings, all other

    nodes monitor master with pings. Knobs around retries and timeouts. eskka All nodes partake in monitoring heartbeats. Knobs for failure certainty* and acceptable heartbeat pause time. Quorum of seed nodes decides availability of unreachable node. * Phi Accrual Failure Detector failure detection
  27. 27 zen minimum_master_nodes contraint violated => we are on minority

    partition eskka Quorum of seed nodes unreachable => we are on minority partition minority partitions
  28. Failure detection is Best Guess. Once decided: • if minority

    partition, either block all operations (no_master_block=all) or write operations only (no_master_block=write) • remove suspect from cluster • fail-over master if required 28 failure handling
  29. node discovery leader election state publishing failure detection / handling

  30. Solid ES Discovery ≠ Jepsen Win 30

  31. What Jepsen tests: an acknowledged write won’t be lost, particularly

    under partition. This has more to do with replication semantics, e.g. • What guarantees are implied when you receive an acknowledgment • How a primary is selected from the replicas of a shard 31
  32. If you evaluate ES Discovery as a distributed store, ClusterState

    is the only document. 32
  33. ClusterState update safety 33

  34. UpdateTask :: ClusterState -> ClusterState 34 Asynchronously applied from single

    thread by InternalClusterService
  35. 35

  36. 36 ClusterStateUpdateTask ProcessedClusterStateUpdateTask TimeoutClusterStateUpdateTask AckedClusterStateUpdateTask + local success callback +

    local failure callback on timeout + success/failure callbacks on ack from other nodes within an ack-timeout local failure callback on errors in applying update or executing listeners
  37. 37 NOT (most metadata update requests do use AckedClusterStateUpdateTask)

  38. System overall seems workable. Ability to replace Elasticsearch Discovery is

    awesome. Doc replication semantics need work! 38
  39. 39 thank you shikhar@etsy.com @shikhrr