Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch: Distributed Search Under the Hood

Elasticsearch: Distributed Search Under the Hood

We're searching across vast amounts of data every day. None of that data is stored in a single system - due to its sheer size and to prevent data loss on node outages. But how does a distributed search look like? How is data aggregated and calculated that is placed on dozens of nodes? How is coordination working in a good state of a cluster? What happens if things go sideways and how does recovery within a cluster work? Using Elasticsearch as an example this talk shows a fair of examples of distributed communication, computation, storage and statefulness in order to give you a glimpse what is done to keep data safe and available, no matter if running on bare-metal, on k8s or virtualized instances.

Alexander Reelsen

September 30, 2021
Tweet

More Decks by Alexander Reelsen

Other Decks in Technology

Transcript

  1. Distributed Search...
    ... Under the Hood
    Alexander Reelsen

    [email protected] | @spinscale

    View full-size slide

  2. Today's goal: Understanding...
    Complexities
    Tradeoffs
    Simplifications
    Error scenarios
    ... distributed systems

    View full-size slide

  3. Agenda
    Distributed systems - But why?
    Elasticsearch
    Data
    Search
    Analytics

    View full-size slide

  4. The need for distributed systems
    Exceeding single system limits (CPU, Memory,
    Storage)
    Load sharing
    Parallelization (shorter response times)
    Reliability (SPOF)
    Price

    View full-size slide

  5. Load sharing

    View full-size slide

  6. Load sharing
    Sychronization
    Coordination & Load balancing

    View full-size slide

  7. Parallelization

    View full-size slide

  8. Parallelization
    Reduce?
    Sort?

    View full-size slide

  9. Boundaries increase
    complexity
    Core
    Computer
    LAN
    WAN
    Internet

    View full-size slide

  10. Boundaries increase
    complexity
    Core
    Computer
    LAN
    WAN
    Internet

    View full-size slide

  11. Boundaries increase
    complexity
    Core
    Computer
    LAN
    WAN
    Internet

    View full-size slide

  12. Boundaries increase
    complexity
    🗣 Communication

    ✍ Coordination

    Error handling

    View full-size slide

  13. Fallacies of Distributed Computing
    The network is reliable
    Latency is zero
    Bandwidth is infinite
    The network is secure
    Topology doesn't change
    There is one administrator
    Transport cost is zero
    The network is homogeneous

    View full-size slide

  14. Consensus
    Achieving a common state among participants
    Byzantine Failures
    Trust
    Crash
    Quorum vs. strictness

    View full-size slide

  15. Consensus goals
    Cluster Membership
    Data writes
    Security (BTC)
    Finding a leader (Paxos, Raft)

    View full-size slide

  16. Elasticsearch introduction

    View full-size slide

  17. Elasticsearch
    Speed. Scale. Relevance.
    HTTP based JSON interface
    Scales to many nodes
    Fast responses
    Ranked results (BM25, recency, popularity)
    Resiliency
    Flexibility (index time vs. query time)
    Based on Apache Lucene

    View full-size slide

  18. Use Cases
    E-Commerce, E-Procurement, Patents, Dating
    Maps: Geo based search
    Observability: Logs, Metrics, APM, Uptime
    Enterprise Search: Site & App Search, Workplace Search
    Security: SIEM, Endpoint Security

    View full-size slide

  19. Master node
    Pings other nodes
    Decides data placement
    Removes nodes from the cluster
    Not needed for reading/writing
    Updates the cluster state and distributes to all nodes
    Re-election on failure

    View full-size slide

  20. Node startup

    View full-size slide

  21. Elects itself

    View full-size slide

  22. Master node is not reachable

    View full-size slide

  23. Reelection within remaining nodes

    View full-size slide

  24. Cluster State
    Nodes
    Data (Shards on nodes)
    Mapping (DDL)
    Updated based on events (node join/leave, index creation, mapping update)
    Sent as diff due to size

    View full-size slide

  25. Data distribution
    Shard: Unit of work, self contained inverted index
    Index: A logical grouping of shards
    Primary shard: Partitioning of data in an index (write scalability)
    Replica shard: Copy of a primary shard (read scalability)

    View full-size slide

  26. Primary shards

    View full-size slide

  27. Primary shards per index

    View full-size slide

  28. Redistribution

    View full-size slide

  29. Redistribution

    View full-size slide

  30. Distributed search
    Two phase approach
    Query all shards, collect top-k hits
    Sort all search results on coordinating node
    Create real top-k result from all results
    Fetch data for all real results (top-k instead of shard_count * top-k )

    View full-size slide

  31. Adaptive Replica Selection
    Which shards are the best to select?
    Each node contains information of
    Response time of prior requests
    Previous search durations
    Search threadpool size
    Less loaded nodes will retrieve more queries
    More info: Blog post, C3 paper

    View full-size slide

  32. Searching faster by searching less
    Optimization applies to all shards in the query phase
    Skip non competitive hits for top-k retrieval
    Trading in accuracy of hit counts for speed
    More info: Blog post

    View full-size slide

  33. Example: elasticsearch OR kibana OR logstash
    At some point there is a minimal score required to be in the top-k

    documents
    If one of the search terms has a lower score than that minimal score it

    can be skipped
    Query could be changed to elasticsearch OR kibana for finding matches,

    thus skipping all documents containing only logstash
    Result: Major speed up

    View full-size slide

  34. Lucene Nightly Benchmarks

    View full-size slide

  35. Many more optimizations
    Skip lists (search delta encoded postings list)
    Two phase iterations (approximation & verification)
    Integer compression
    Data structures like BKD tree for numbers, FSTs for completion
    Index sorting

    View full-size slide

  36. Aggregations
    Aggregations run on top of a result set of a query
    Slice, dice und combine data to get insights
    Show me the total sales value by quarter for each sales person
    The average response time per URL endpoint per day
    The number of products within each category
    The biggest order of each month in the last year

    View full-size slide

  37. Distributed Aggregations
    Some calculations require your data to be central for a certain use-case
    Unique values in my dataset - how does this work across shards without

    sending the whole data set to a single node?
    Solution: Be less accurate, sometimes be probabilistic!

    View full-size slide

  38. terms Aggregation
    Count all the categories from returned products

    View full-size slide

  39. terms Aggregation

    View full-size slide

  40. Counts
    Count more than top-n buckets: size * 1.5 + 10
    Does not eliminate the problem, reduces it only
    Provide doc_count_error_upper_bound
    Possible solution: Add more roundtrips?

    View full-size slide

  41. Probabilistic Data Structures
    Membership check: Bloom, Cuckoo filters
    Frequencies in an event stream: Count-min sketch
    Cardinality: LogLog algorithms
    Quantiles: HDR, T-Digest

    View full-size slide

  42. How many distinct elements are across my whole dataset?
    cardinality Aggregation

    View full-size slide

  43. cardinality
    Result: 40-65?!

    View full-size slide

  44. cardinality Aggregation
    Solution: HyperLogLog++
    mergeable data structure
    Approximate
    Trades memory for accuracy
    Fixed memory usage based on configured precision_threshold

    View full-size slide

  45. percentiles Aggregation
    Naive implementation (sorted array) is not mergeable across shards and

    scales with the number of documents in a shard
    T-Digest utilizes a clustering approach, that reduces memory usage by falling

    back into approximation at a certain size

    View full-size slide

  46. Summary
    Tradeoffs
    Behaviour
    Algorithms
    Data Structures
    Every distributed system is
    different

    View full-size slide

  47. Summary - ease distributed systems usage
    For developers
    Elasticsearch Clients check for nodes in the background
    For operations
    ECK, terraform provider, ecctl

    View full-size slide

  48. More...
    SQL: Distributed transactions
    SQL: Distributed joins
    Collaborative Editing
    CRDTs
    Failure detection (Phi/Adaptive accrual failure detector)
    Data partitioning (ring based)
    Recovery after errors (read repair)
    Secondary indexes
    Leader/Follower approaches

    View full-size slide

  49. More...
    Cluster Membership: Gossip (Scuttlebutt, SWIM)
    Spanner (CockroachDB)
    Calvin (Fauna)
    Java: ScaleCube, swim-java
    Java: Ratis, JRaft
    Java: Atomix
    Go: Serf

    View full-size slide

  50. Consistency models, from jepsen.io

    View full-size slide

  51. Books, books, books

    View full-size slide

  52. Books, books, books

    View full-size slide

  53. Thanks for listening!

    View full-size slide