Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch: Distributed Search Under the Hood

Elasticsearch: Distributed Search Under the Hood

We're searching across vast amounts of data every day. None of that data is stored in a single system - due to its sheer size and to prevent data loss on node outages. But how does a distributed search look like? How is data aggregated and calculated that is placed on dozens of nodes? How is coordination working in a good state of a cluster? What happens if things go sideways and how does recovery within a cluster work? Using Elasticsearch as an example this talk shows a fair of examples of distributed communication, computation, storage and statefulness in order to give you a glimpse what is done to keep data safe and available, no matter if running on bare-metal, on k8s or virtualized instances.

Alexander Reelsen

September 30, 2021
Tweet

More Decks by Alexander Reelsen

Other Decks in Technology

Transcript

  1. Distributed Search...
    ... Under the Hood
    Alexander Reelsen

    [email protected] | @spinscale

    View Slide

  2. Today's goal: Understanding...
    Complexities
    Tradeoffs
    Simplifications
    Error scenarios
    ... distributed systems

    View Slide

  3. Agenda
    Distributed systems - But why?
    Elasticsearch
    Data
    Search
    Analytics

    View Slide

  4. The need for distributed systems
    Exceeding single system limits (CPU, Memory,
    Storage)
    Load sharing
    Parallelization (shorter response times)
    Reliability (SPOF)
    Price

    View Slide

  5. Load sharing

    View Slide

  6. Load sharing
    Sychronization
    Coordination & Load balancing

    View Slide

  7. Reliability

    View Slide

  8. Parallelization

    View Slide

  9. Parallelization
    Reduce?
    Sort?

    View Slide

  10. Boundaries increase
    complexity
    Core
    Computer
    LAN
    WAN
    Internet

    View Slide

  11. Boundaries increase
    complexity
    Core
    Computer
    LAN
    WAN
    Internet

    View Slide

  12. Boundaries increase
    complexity
    Core
    Computer
    LAN
    WAN
    Internet

    View Slide

  13. Boundaries increase
    complexity
    🗣 Communication

    ✍ Coordination

    Error handling

    View Slide

  14. Fallacies of Distributed Computing
    The network is reliable
    Latency is zero
    Bandwidth is infinite
    The network is secure
    Topology doesn't change
    There is one administrator
    Transport cost is zero
    The network is homogeneous

    View Slide

  15. View Slide

  16. Consensus
    Achieving a common state among participants
    Byzantine Failures
    Trust
    Crash
    Quorum vs. strictness

    View Slide

  17. Consensus goals
    Cluster Membership
    Data writes
    Security (BTC)
    Finding a leader (Paxos, Raft)

    View Slide

  18. Elasticsearch introduction

    View Slide

  19. Elasticsearch
    Speed. Scale. Relevance.
    HTTP based JSON interface
    Scales to many nodes
    Fast responses
    Ranked results (BM25, recency, popularity)
    Resiliency
    Flexibility (index time vs. query time)
    Based on Apache Lucene

    View Slide

  20. Use Cases
    E-Commerce, E-Procurement, Patents, Dating
    Maps: Geo based search
    Observability: Logs, Metrics, APM, Uptime
    Enterprise Search: Site & App Search, Workplace Search
    Security: SIEM, Endpoint Security

    View Slide

  21. Master

    View Slide

  22. Master node
    Pings other nodes
    Decides data placement
    Removes nodes from the cluster
    Not needed for reading/writing
    Updates the cluster state and distributes to all nodes
    Re-election on failure

    View Slide

  23. Node startup

    View Slide

  24. Elects itself

    View Slide

  25. Node join

    View Slide

  26. Node join

    View Slide

  27. Master node is not reachable

    View Slide

  28. Reelection within remaining nodes

    View Slide

  29. Cluster State
    Nodes
    Data (Shards on nodes)
    Mapping (DDL)
    Updated based on events (node join/leave, index creation, mapping update)
    Sent as diff due to size

    View Slide

  30. Data distribution
    Shard: Unit of work, self contained inverted index
    Index: A logical grouping of shards
    Primary shard: Partitioning of data in an index (write scalability)
    Replica shard: Copy of a primary shard (read scalability)

    View Slide

  31. Primary shards

    View Slide

  32. Primary shards per index

    View Slide

  33. Redistribution

    View Slide

  34. Redistribution

    View Slide

  35. Replicas

    View Slide

  36. View Slide

  37. Distributed search
    Two phase approach
    Query all shards, collect top-k hits
    Sort all search results on coordinating node
    Create real top-k result from all results
    Fetch data for all real results (top-k instead of shard_count * top-k )

    View Slide

  38. View Slide

  39. View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. View Slide

  44. Adaptive Replica Selection
    Which shards are the best to select?
    Each node contains information of
    Response time of prior requests
    Previous search durations
    Search threadpool size
    Less loaded nodes will retrieve more queries
    More info: Blog post, C3 paper

    View Slide

  45. Searching faster by searching less
    Optimization applies to all shards in the query phase
    Skip non competitive hits for top-k retrieval
    Trading in accuracy of hit counts for speed
    More info: Blog post

    View Slide

  46. Example: elasticsearch OR kibana OR logstash
    At some point there is a minimal score required to be in the top-k

    documents
    If one of the search terms has a lower score than that minimal score it

    can be skipped
    Query could be changed to elasticsearch OR kibana for finding matches,

    thus skipping all documents containing only logstash
    Result: Major speed up

    View Slide

  47. Lucene Nightly Benchmarks

    View Slide

  48. Many more optimizations
    Skip lists (search delta encoded postings list)
    Two phase iterations (approximation & verification)
    Integer compression
    Data structures like BKD tree for numbers, FSTs for completion
    Index sorting

    View Slide

  49. Aggregations
    Aggregations run on top of a result set of a query
    Slice, dice und combine data to get insights
    Show me the total sales value by quarter for each sales person
    The average response time per URL endpoint per day
    The number of products within each category
    The biggest order of each month in the last year

    View Slide

  50. Distributed Aggregations
    Some calculations require your data to be central for a certain use-case
    Unique values in my dataset - how does this work across shards without

    sending the whole data set to a single node?
    Solution: Be less accurate, sometimes be probabilistic!

    View Slide

  51. terms Aggregation
    Count all the categories from returned products

    View Slide

  52. terms Aggregation

    View Slide

  53. Counts

    View Slide

  54. Counts

    View Slide

  55. Counts
    Count more than top-n buckets: size * 1.5 + 10
    Does not eliminate the problem, reduces it only
    Provide doc_count_error_upper_bound
    Possible solution: Add more roundtrips?

    View Slide

  56. Probabilistic Data Structures
    Membership check: Bloom, Cuckoo filters
    Frequencies in an event stream: Count-min sketch
    Cardinality: LogLog algorithms
    Quantiles: HDR, T-Digest

    View Slide

  57. How many distinct elements are across my whole dataset?
    cardinality Aggregation

    View Slide

  58. cardinality
    Result: 40-65?!

    View Slide

  59. cardinality Aggregation
    Solution: HyperLogLog++
    mergeable data structure
    Approximate
    Trades memory for accuracy
    Fixed memory usage based on configured precision_threshold

    View Slide

  60. percentiles Aggregation
    Naive implementation (sorted array) is not mergeable across shards and

    scales with the number of documents in a shard
    T-Digest utilizes a clustering approach, that reduces memory usage by falling

    back into approximation at a certain size

    View Slide

  61. Summary
    Tradeoffs
    Behaviour
    Algorithms
    Data Structures
    Every distributed system is
    different

    View Slide

  62. Summary - ease distributed systems usage
    For developers
    Elasticsearch Clients check for nodes in the background
    For operations
    ECK, terraform provider, ecctl

    View Slide

  63. More...
    SQL: Distributed transactions
    SQL: Distributed joins
    Collaborative Editing
    CRDTs
    Failure detection (Phi/Adaptive accrual failure detector)
    Data partitioning (ring based)
    Recovery after errors (read repair)
    Secondary indexes
    Leader/Follower approaches

    View Slide

  64. More...
    Cluster Membership: Gossip (Scuttlebutt, SWIM)
    Spanner (CockroachDB)
    Calvin (Fauna)
    Java: ScaleCube, swim-java
    Java: Ratis, JRaft
    Java: Atomix
    Go: Serf

    View Slide

  65. Consistency models, from jepsen.io

    View Slide

  66. Academia

    View Slide

  67. Books, books, books

    View Slide

  68. Books, books, books

    View Slide

  69. Thanks for listening!

    View Slide