Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch: Distributed Search Under the Hood

Elasticsearch: Distributed Search Under the Hood

We're searching across vast amounts of data every day. None of that data is stored in a single system - due to its sheer size and to prevent data loss on node outages. But how does a distributed search look like? How is data aggregated and calculated that is placed on dozens of nodes? How is coordination working in a good state of a cluster? What happens if things go sideways and how does recovery within a cluster work? Using Elasticsearch as an example this talk shows a fair of examples of distributed communication, computation, storage and statefulness in order to give you a glimpse what is done to keep data safe and available, no matter if running on bare-metal, on k8s or virtualized instances.

Alexander Reelsen

September 30, 2021
Tweet

More Decks by Alexander Reelsen

Other Decks in Technology

Transcript

  1. The need for distributed systems Exceeding single system limits (CPU,

    Memory, Storage) Load sharing Parallelization (shorter response times) Reliability (SPOF) Price
  2. Fallacies of Distributed Computing The network is reliable Latency is

    zero Bandwidth is infinite The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous
  3. Elasticsearch Speed. Scale. Relevance. HTTP based JSON interface Scales to

    many nodes Fast responses Ranked results (BM25, recency, popularity) Resiliency Flexibility (index time vs. query time) Based on Apache Lucene
  4. Use Cases E-Commerce, E-Procurement, Patents, Dating Maps: Geo based search

    Observability: Logs, Metrics, APM, Uptime Enterprise Search: Site & App Search, Workplace Search Security: SIEM, Endpoint Security
  5. Master node Pings other nodes Decides data placement Removes nodes

    from the cluster Not needed for reading/writing Updates the cluster state and distributes to all nodes Re-election on failure
  6. Cluster State Nodes Data (Shards on nodes) Mapping (DDL) Updated

    based on events (node join/leave, index creation, mapping update) Sent as diff due to size
  7. Data distribution Shard: Unit of work, self contained inverted index

    Index: A logical grouping of shards Primary shard: Partitioning of data in an index (write scalability) Replica shard: Copy of a primary shard (read scalability)
  8. Distributed search Two phase approach Query all shards, collect top-k

    hits Sort all search results on coordinating node Create real top-k result from all results Fetch data for all real results (top-k instead of shard_count * top-k )
  9. Adaptive Replica Selection Which shards are the best to select?

    Each node contains information of Response time of prior requests Previous search durations Search threadpool size Less loaded nodes will retrieve more queries More info: Blog post, C3 paper
  10. Searching faster by searching less Optimization applies to all shards

    in the query phase Skip non competitive hits for top-k retrieval Trading in accuracy of hit counts for speed More info: Blog post
  11. Example: elasticsearch OR kibana OR logstash At some point there

    is a minimal score required to be in the top-k documents If one of the search terms has a lower score than that minimal score it can be skipped Query could be changed to elasticsearch OR kibana for finding matches, thus skipping all documents containing only logstash Result: Major speed up
  12. Many more optimizations Skip lists (search delta encoded postings list)

    Two phase iterations (approximation & verification) Integer compression Data structures like BKD tree for numbers, FSTs for completion Index sorting
  13. Aggregations Aggregations run on top of a result set of

    a query Slice, dice und combine data to get insights Show me the total sales value by quarter for each sales person The average response time per URL endpoint per day The number of products within each category The biggest order of each month in the last year
  14. Distributed Aggregations Some calculations require your data to be central

    for a certain use-case Unique values in my dataset - how does this work across shards without sending the whole data set to a single node? Solution: Be less accurate, sometimes be probabilistic!
  15. Counts Count more than top-n buckets: size * 1.5 +

    10 Does not eliminate the problem, reduces it only Provide doc_count_error_upper_bound Possible solution: Add more roundtrips?
  16. Probabilistic Data Structures Membership check: Bloom, Cuckoo filters Frequencies in

    an event stream: Count-min sketch Cardinality: LogLog algorithms Quantiles: HDR, T-Digest
  17. cardinality Aggregation Solution: HyperLogLog++ mergeable data structure Approximate Trades memory

    for accuracy Fixed memory usage based on configured precision_threshold
  18. percentiles Aggregation Naive implementation (sorted array) is not mergeable across

    shards and scales with the number of documents in a shard T-Digest utilizes a clustering approach, that reduces memory usage by falling back into approximation at a certain size
  19. Summary - ease distributed systems usage For developers Elasticsearch Clients

    check for nodes in the background For operations ECK, terraform provider, ecctl
  20. More... SQL: Distributed transactions SQL: Distributed joins Collaborative Editing CRDTs

    Failure detection (Phi/Adaptive accrual failure detector) Data partitioning (ring based) Recovery after errors (read repair) Secondary indexes Leader/Follower approaches
  21. More... Cluster Membership: Gossip (Scuttlebutt, SWIM) Spanner (CockroachDB) Calvin (Fauna)

    Java: ScaleCube, swim-java Java: Ratis, JRaft Java: Atomix Go: Serf