Elasticsearch: Distributed Search Under the Hood

Distributed Search... ... Under the Hood Alexander Reelsen [email protected] |
@spinscale

Today's goal: Understanding... Complexities Tradeoffs Simplifications Error scenarios ... distributed
systems

Agenda Distributed systems - But why? Elasticsearch Data Search Analytics

The need for distributed systems Exceeding single system limits (CPU,
Memory, Storage) Load sharing Parallelization (shorter response times) Reliability (SPOF) Price

Load sharing

Load sharing Sychronization Coordination & Load balancing

Reliability

Parallelization

Parallelization Reduce? Sort?

Boundaries increase complexity Core Computer LAN WAN Internet

Boundaries increase complexity 🗣 Communication ✍ Coordination Error handling

Fallacies of Distributed Computing The network is reliable Latency is
zero Bandwidth is infinite The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous

Consensus Achieving a common state among participants Byzantine Failures Trust
Crash Quorum vs. strictness

Consensus goals Cluster Membership Data writes Security (BTC) Finding a
leader (Paxos, Raft)

Elasticsearch introduction

Elasticsearch Speed. Scale. Relevance. HTTP based JSON interface Scales to
many nodes Fast responses Ranked results (BM25, recency, popularity) Resiliency Flexibility (index time vs. query time) Based on Apache Lucene

Use Cases E-Commerce, E-Procurement, Patents, Dating Maps: Geo based search
Observability: Logs, Metrics, APM, Uptime Enterprise Search: Site & App Search, Workplace Search Security: SIEM, Endpoint Security

Master

Master node Pings other nodes Decides data placement Removes nodes
from the cluster Not needed for reading/writing Updates the cluster state and distributes to all nodes Re-election on failure

Node startup

Elects itself

Node join

Master node is not reachable

Reelection within remaining nodes

Cluster State Nodes Data (Shards on nodes) Mapping (DDL) Updated
based on events (node join/leave, index creation, mapping update) Sent as diff due to size

Data distribution Shard: Unit of work, self contained inverted index
Index: A logical grouping of shards Primary shard: Partitioning of data in an index (write scalability) Replica shard: Copy of a primary shard (read scalability)

Primary shards

Primary shards per index

Redistribution

Replicas

Distributed search Two phase approach Query all shards, collect top-k
hits Sort all search results on coordinating node Create real top-k result from all results Fetch data for all real results (top-k instead of shard_count * top-k )

Adaptive Replica Selection Which shards are the best to select?
Each node contains information of Response time of prior requests Previous search durations Search threadpool size Less loaded nodes will retrieve more queries More info: Blog post, C3 paper

Searching faster by searching less Optimization applies to all shards
in the query phase Skip non competitive hits for top-k retrieval Trading in accuracy of hit counts for speed More info: Blog post

Example: elasticsearch OR kibana OR logstash At some point there
is a minimal score required to be in the top-k documents If one of the search terms has a lower score than that minimal score it can be skipped Query could be changed to elasticsearch OR kibana for finding matches, thus skipping all documents containing only logstash Result: Major speed up

Lucene Nightly Benchmarks

Many more optimizations Skip lists (search delta encoded postings list)
Two phase iterations (approximation & verification) Integer compression Data structures like BKD tree for numbers, FSTs for completion Index sorting

Aggregations Aggregations run on top of a result set of
a query Slice, dice und combine data to get insights Show me the total sales value by quarter for each sales person The average response time per URL endpoint per day The number of products within each category The biggest order of each month in the last year

Distributed Aggregations Some calculations require your data to be central
for a certain use-case Unique values in my dataset - how does this work across shards without sending the whole data set to a single node? Solution: Be less accurate, sometimes be probabilistic!

terms Aggregation Count all the categories from returned products

terms Aggregation

Counts

Counts Count more than top-n buckets: size * 1.5 +
10 Does not eliminate the problem, reduces it only Provide doc_count_error_upper_bound Possible solution: Add more roundtrips?

Probabilistic Data Structures Membership check: Bloom, Cuckoo filters Frequencies in
an event stream: Count-min sketch Cardinality: LogLog algorithms Quantiles: HDR, T-Digest

How many distinct elements are across my whole dataset? cardinality
Aggregation

cardinality Result: 40-65?!

cardinality Aggregation Solution: HyperLogLog++ mergeable data structure Approximate Trades memory
for accuracy Fixed memory usage based on configured precision_threshold

percentiles Aggregation Naive implementation (sorted array) is not mergeable across
shards and scales with the number of documents in a shard T-Digest utilizes a clustering approach, that reduces memory usage by falling back into approximation at a certain size

Summary Tradeoffs Behaviour Algorithms Data Structures Every distributed system is
different

Summary - ease distributed systems usage For developers Elasticsearch Clients
check for nodes in the background For operations ECK, terraform provider, ecctl

More... SQL: Distributed transactions SQL: Distributed joins Collaborative Editing CRDTs
Failure detection (Phi/Adaptive accrual failure detector) Data partitioning (ring based) Recovery after errors (read repair) Secondary indexes Leader/Follower approaches

More... Cluster Membership: Gossip (Scuttlebutt, SWIM) Spanner (CockroachDB) Calvin (Fauna)
Java: ScaleCube, swim-java Java: Ratis, JRaft Java: Atomix Go: Serf

Consistency models, from jepsen.io

Academia

Books, books, books

Thanks for listening!

Elasticsearch: Distributed Search Under the Hood

Elasticsearch: Distributed Search Under the Hood

More Decks by Alexander Reelsen

Other Decks in Technology

Featured

Transcript