Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch: Distributed Search Under the Hood

Elasticsearch: Distributed Search Under the Hood

We're searching across vast amounts of data every day. None of that data is stored in a single system - due to its sheer size and to prevent data loss on node outages. But how does a distributed search look like? How is data aggregated and calculated that is placed on dozens of nodes? How is coordination working in a good state of a cluster? What happens if things go sideways and how does recovery within a cluster work? Using Elasticsearch as an example this talk shows a fair of examples of distributed communication, computation, storage and statefulness in order to give you a glimpse what is done to keep data safe and available, no matter if running on bare-metal, on k8s or virtualized instances.


Alexander Reelsen

September 30, 2021

More Decks by Alexander Reelsen

Other Decks in Technology


  1. Distributed Search... ... Under the Hood Alexander Reelsen alex@elastic.co |

  2. Today's goal: Understanding... Complexities Tradeoffs Simplifications Error scenarios ... distributed

  3. Agenda Distributed systems - But why? Elasticsearch Data Search Analytics

  4. The need for distributed systems Exceeding single system limits (CPU,

    Memory, Storage) Load sharing Parallelization (shorter response times) Reliability (SPOF) Price
  5. Load sharing

  6. Load sharing Sychronization Coordination & Load balancing

  7. Reliability

  8. Parallelization

  9. Parallelization Reduce? Sort?

  10. Boundaries increase complexity Core Computer LAN WAN Internet

  11. Boundaries increase complexity Core Computer LAN WAN Internet

  12. Boundaries increase complexity Core Computer LAN WAN Internet

  13. Boundaries increase complexity 🗣 Communication ✍ Coordination Error handling

  14. Fallacies of Distributed Computing The network is reliable Latency is

    zero Bandwidth is infinite The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous
  15. None
  16. Consensus Achieving a common state among participants Byzantine Failures Trust

    Crash Quorum vs. strictness
  17. Consensus goals Cluster Membership Data writes Security (BTC) Finding a

    leader (Paxos, Raft)
  18. Elasticsearch introduction

  19. Elasticsearch Speed. Scale. Relevance. HTTP based JSON interface Scales to

    many nodes Fast responses Ranked results (BM25, recency, popularity) Resiliency Flexibility (index time vs. query time) Based on Apache Lucene
  20. Use Cases E-Commerce, E-Procurement, Patents, Dating Maps: Geo based search

    Observability: Logs, Metrics, APM, Uptime Enterprise Search: Site & App Search, Workplace Search Security: SIEM, Endpoint Security
  21. Master

  22. Master node Pings other nodes Decides data placement Removes nodes

    from the cluster Not needed for reading/writing Updates the cluster state and distributes to all nodes Re-election on failure
  23. Node startup

  24. Elects itself

  25. Node join

  26. Node join

  27. Master node is not reachable

  28. Reelection within remaining nodes

  29. Cluster State Nodes Data (Shards on nodes) Mapping (DDL) Updated

    based on events (node join/leave, index creation, mapping update) Sent as diff due to size
  30. Data distribution Shard: Unit of work, self contained inverted index

    Index: A logical grouping of shards Primary shard: Partitioning of data in an index (write scalability) Replica shard: Copy of a primary shard (read scalability)
  31. Primary shards

  32. Primary shards per index

  33. Redistribution

  34. Redistribution

  35. Replicas

  36. None
  37. Distributed search Two phase approach Query all shards, collect top-k

    hits Sort all search results on coordinating node Create real top-k result from all results Fetch data for all real results (top-k instead of shard_count * top-k )
  38. None
  39. None
  40. None
  41. None
  42. None
  43. None
  44. Adaptive Replica Selection Which shards are the best to select?

    Each node contains information of Response time of prior requests Previous search durations Search threadpool size Less loaded nodes will retrieve more queries More info: Blog post, C3 paper
  45. Searching faster by searching less Optimization applies to all shards

    in the query phase Skip non competitive hits for top-k retrieval Trading in accuracy of hit counts for speed More info: Blog post
  46. Example: elasticsearch OR kibana OR logstash At some point there

    is a minimal score required to be in the top-k documents If one of the search terms has a lower score than that minimal score it can be skipped Query could be changed to elasticsearch OR kibana for finding matches, thus skipping all documents containing only logstash Result: Major speed up
  47. Lucene Nightly Benchmarks

  48. Many more optimizations Skip lists (search delta encoded postings list)

    Two phase iterations (approximation & verification) Integer compression Data structures like BKD tree for numbers, FSTs for completion Index sorting
  49. Aggregations Aggregations run on top of a result set of

    a query Slice, dice und combine data to get insights Show me the total sales value by quarter for each sales person The average response time per URL endpoint per day The number of products within each category The biggest order of each month in the last year
  50. Distributed Aggregations Some calculations require your data to be central

    for a certain use-case Unique values in my dataset - how does this work across shards without sending the whole data set to a single node? Solution: Be less accurate, sometimes be probabilistic!
  51. terms Aggregation Count all the categories from returned products

  52. terms Aggregation

  53. Counts

  54. Counts

  55. Counts Count more than top-n buckets: size * 1.5 +

    10 Does not eliminate the problem, reduces it only Provide doc_count_error_upper_bound Possible solution: Add more roundtrips?
  56. Probabilistic Data Structures Membership check: Bloom, Cuckoo filters Frequencies in

    an event stream: Count-min sketch Cardinality: LogLog algorithms Quantiles: HDR, T-Digest
  57. How many distinct elements are across my whole dataset? cardinality

  58. cardinality Result: 40-65?!

  59. cardinality Aggregation Solution: HyperLogLog++ mergeable data structure Approximate Trades memory

    for accuracy Fixed memory usage based on configured precision_threshold
  60. percentiles Aggregation Naive implementation (sorted array) is not mergeable across

    shards and scales with the number of documents in a shard T-Digest utilizes a clustering approach, that reduces memory usage by falling back into approximation at a certain size
  61. Summary Tradeoffs Behaviour Algorithms Data Structures Every distributed system is

  62. Summary - ease distributed systems usage For developers Elasticsearch Clients

    check for nodes in the background For operations ECK, terraform provider, ecctl
  63. More... SQL: Distributed transactions SQL: Distributed joins Collaborative Editing CRDTs

    Failure detection (Phi/Adaptive accrual failure detector) Data partitioning (ring based) Recovery after errors (read repair) Secondary indexes Leader/Follower approaches
  64. More... Cluster Membership: Gossip (Scuttlebutt, SWIM) Spanner (CockroachDB) Calvin (Fauna)

    Java: ScaleCube, swim-java Java: Ratis, JRaft Java: Atomix Go: Serf
  65. Consistency models, from jepsen.io

  66. Academia

  67. Books, books, books

  68. Books, books, books

  69. Thanks for listening!