Slide 1

Slide 1 text

Distributed Search... ... Under the Hood Alexander Reelsen [email protected] | @spinscale

Slide 2

Slide 2 text

Today's goal: Understanding... Complexities Tradeoffs Simplifications Error scenarios ... distributed systems

Slide 3

Slide 3 text

Agenda Distributed systems - But why? Elasticsearch Data Search Analytics

Slide 4

Slide 4 text

The need for distributed systems Exceeding single system limits (CPU, Memory, Storage) Load sharing Parallelization (shorter response times) Reliability (SPOF) Price

Slide 5

Slide 5 text

Load sharing

Slide 6

Slide 6 text

Load sharing Sychronization Coordination & Load balancing

Slide 7

Slide 7 text

Reliability

Slide 8

Slide 8 text

Parallelization

Slide 9

Slide 9 text

Parallelization Reduce? Sort?

Slide 10

Slide 10 text

Boundaries increase complexity Core Computer LAN WAN Internet

Slide 11

Slide 11 text

Boundaries increase complexity Core Computer LAN WAN Internet

Slide 12

Slide 12 text

Boundaries increase complexity Core Computer LAN WAN Internet

Slide 13

Slide 13 text

Boundaries increase complexity 🗣 Communication ✍ Coordination Error handling

Slide 14

Slide 14 text

Fallacies of Distributed Computing The network is reliable Latency is zero Bandwidth is infinite The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Consensus Achieving a common state among participants Byzantine Failures Trust Crash Quorum vs. strictness

Slide 17

Slide 17 text

Consensus goals Cluster Membership Data writes Security (BTC) Finding a leader (Paxos, Raft)

Slide 18

Slide 18 text

Elasticsearch introduction

Slide 19

Slide 19 text

Elasticsearch Speed. Scale. Relevance. HTTP based JSON interface Scales to many nodes Fast responses Ranked results (BM25, recency, popularity) Resiliency Flexibility (index time vs. query time) Based on Apache Lucene

Slide 20

Slide 20 text

Use Cases E-Commerce, E-Procurement, Patents, Dating Maps: Geo based search Observability: Logs, Metrics, APM, Uptime Enterprise Search: Site & App Search, Workplace Search Security: SIEM, Endpoint Security

Slide 21

Slide 21 text

Master

Slide 22

Slide 22 text

Master node Pings other nodes Decides data placement Removes nodes from the cluster Not needed for reading/writing Updates the cluster state and distributes to all nodes Re-election on failure

Slide 23

Slide 23 text

Node startup

Slide 24

Slide 24 text

Elects itself

Slide 25

Slide 25 text

Node join

Slide 26

Slide 26 text

Node join

Slide 27

Slide 27 text

Master node is not reachable

Slide 28

Slide 28 text

Reelection within remaining nodes

Slide 29

Slide 29 text

Cluster State Nodes Data (Shards on nodes) Mapping (DDL) Updated based on events (node join/leave, index creation, mapping update) Sent as diff due to size

Slide 30

Slide 30 text

Data distribution Shard: Unit of work, self contained inverted index Index: A logical grouping of shards Primary shard: Partitioning of data in an index (write scalability) Replica shard: Copy of a primary shard (read scalability)

Slide 31

Slide 31 text

Primary shards

Slide 32

Slide 32 text

Primary shards per index

Slide 33

Slide 33 text

Redistribution

Slide 34

Slide 34 text

Redistribution

Slide 35

Slide 35 text

Replicas

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Distributed search Two phase approach Query all shards, collect top-k hits Sort all search results on coordinating node Create real top-k result from all results Fetch data for all real results (top-k instead of shard_count * top-k )

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Adaptive Replica Selection Which shards are the best to select? Each node contains information of Response time of prior requests Previous search durations Search threadpool size Less loaded nodes will retrieve more queries More info: Blog post, C3 paper

Slide 45

Slide 45 text

Searching faster by searching less Optimization applies to all shards in the query phase Skip non competitive hits for top-k retrieval Trading in accuracy of hit counts for speed More info: Blog post

Slide 46

Slide 46 text

Example: elasticsearch OR kibana OR logstash At some point there is a minimal score required to be in the top-k documents If one of the search terms has a lower score than that minimal score it can be skipped Query could be changed to elasticsearch OR kibana for finding matches, thus skipping all documents containing only logstash Result: Major speed up

Slide 47

Slide 47 text

Lucene Nightly Benchmarks

Slide 48

Slide 48 text

Many more optimizations Skip lists (search delta encoded postings list) Two phase iterations (approximation & verification) Integer compression Data structures like BKD tree for numbers, FSTs for completion Index sorting

Slide 49

Slide 49 text

Aggregations Aggregations run on top of a result set of a query Slice, dice und combine data to get insights Show me the total sales value by quarter for each sales person The average response time per URL endpoint per day The number of products within each category The biggest order of each month in the last year

Slide 50

Slide 50 text

Distributed Aggregations Some calculations require your data to be central for a certain use-case Unique values in my dataset - how does this work across shards without sending the whole data set to a single node? Solution: Be less accurate, sometimes be probabilistic!

Slide 51

Slide 51 text

terms Aggregation Count all the categories from returned products

Slide 52

Slide 52 text

terms Aggregation

Slide 53

Slide 53 text

Counts

Slide 54

Slide 54 text

Counts

Slide 55

Slide 55 text

Counts Count more than top-n buckets: size * 1.5 + 10 Does not eliminate the problem, reduces it only Provide doc_count_error_upper_bound Possible solution: Add more roundtrips?

Slide 56

Slide 56 text

Probabilistic Data Structures Membership check: Bloom, Cuckoo filters Frequencies in an event stream: Count-min sketch Cardinality: LogLog algorithms Quantiles: HDR, T-Digest

Slide 57

Slide 57 text

How many distinct elements are across my whole dataset? cardinality Aggregation

Slide 58

Slide 58 text

cardinality Result: 40-65?!

Slide 59

Slide 59 text

cardinality Aggregation Solution: HyperLogLog++ mergeable data structure Approximate Trades memory for accuracy Fixed memory usage based on configured precision_threshold

Slide 60

Slide 60 text

percentiles Aggregation Naive implementation (sorted array) is not mergeable across shards and scales with the number of documents in a shard T-Digest utilizes a clustering approach, that reduces memory usage by falling back into approximation at a certain size

Slide 61

Slide 61 text

Summary Tradeoffs Behaviour Algorithms Data Structures Every distributed system is different

Slide 62

Slide 62 text

Summary - ease distributed systems usage For developers Elasticsearch Clients check for nodes in the background For operations ECK, terraform provider, ecctl

Slide 63

Slide 63 text

More... SQL: Distributed transactions SQL: Distributed joins Collaborative Editing CRDTs Failure detection (Phi/Adaptive accrual failure detector) Data partitioning (ring based) Recovery after errors (read repair) Secondary indexes Leader/Follower approaches

Slide 64

Slide 64 text

More... Cluster Membership: Gossip (Scuttlebutt, SWIM) Spanner (CockroachDB) Calvin (Fauna) Java: ScaleCube, swim-java Java: Ratis, JRaft Java: Atomix Go: Serf

Slide 65

Slide 65 text

Consistency models, from jepsen.io

Slide 66

Slide 66 text

Academia

Slide 67

Slide 67 text

Books, books, books

Slide 68

Slide 68 text

Books, books, books

Slide 69

Slide 69 text

Thanks for listening!