Elasticsearch Performance Best Practices

Slide 1

Slide 1 text

Elasticsearch Performance  Best Practices Patrick Peschlow

Slide 2

Slide 2 text

Fundamentals

Slide 3

Slide 3 text

Documents { _id: "1", author: "Patrick Peschlow", title: "Elasticsearch Performance" } { _id: "2", author: "Patrick Peschlow", title: "Elasticsearch Scalability" }

Slide 4

Slide 4 text

Queries { match: { content: "performance" } }

Slide 5

Slide 5 text

Results { hits: { total: 1, hits: [ { _id: "1", _score: 0.15342641, _source : { author: "Patrick Peschlow", title: "Elasticsearch Performance" } } ] } }

Slide 6

Slide 6 text

How indexing works Persisted segments Searchable

Slide 7

Slide 7 text

Persisted segments Searchable How indexing works

Slide 8

Slide 8 text

Transaction Log Persisted segments Searchable How indexing works

Slide 9

Slide 9 text

translog() Transaction Log Persisted segments Searchable How indexing works

Slide 10

Slide 10 text

translog() Indexing Buffer Transaction Log Persisted segments Searchable How indexing works

Slide 11

Slide 11 text

index() Indexing Buffer Transaction Log Persisted segments Searchable How indexing works

Slide 12

Slide 12 text

Indexing Buffer Transaction Log Persisted segments Searchable How indexing works

Slide 13

Slide 13 text

refresh() Indexing Buffer Transaction Log Persisted segments Searchable How indexing works

Slide 14

Slide 14 text

flush() Indexing Buffer Transaction Log Persisted segments Searchable How indexing works

Slide 15

Slide 15 text

Indexing Buffer Transaction Log Persisted segments Searchable How indexing works

Slide 16

Slide 16 text

merge() Indexing Buffer Transaction Log Persisted segments Searchable How indexing works

Slide 17

Slide 17 text

Indexing Buffer Transaction Log Persisted segments Searchable merge() How indexing works

Slide 18

Slide 18 text

Indexing Buffer Transaction Log Persisted segments Searchable merge() How indexing works

Slide 19

Slide 19 text

Indexing Buffer Transaction Log Persisted segments Searchable merge() How indexing works

Slide 20

Slide 20 text

translog() flush() refresh() Indexing Buffer Transaction Log Persisted segments Searchable index() merge() How indexing works

Slide 21

Slide 21 text

How searching works Searcher search() Persisted

Slide 22

Slide 22 text

query() Persisted Searcher How searching works

Slide 23

Slide 23 text

compute_hits_to_return() Persisted Searcher How searching works

Slide 24

Slide 24 text

fetch() Persisted Searcher How searching works

Slide 25

Slide 25 text

Persisted Searcher How searching works

Slide 26

Slide 26 text

Scaling out

Slide 27

Slide 27 text

•Synchronous •Returns only when all replicas have acknowledged receipt Replication Node 1 P1 Node 2 R1

Slide 28

Slide 28 text

•High availability •Automatic failover if the primary fails Replication beneﬁts Node 1 P1 Node 2 R1

Slide 29

Slide 29 text

•Increase capacity for search requests •Default: round-robin Replication beneﬁts Node 1 P1 Node 2 R1

Slide 30

Slide 30 text

•Good: Number of replicas can be changed dynamically •Desired level of fault tolerance? •It’s all about risk •If shard recovery is quick, maybe one replica is enough? •More replicas require more hardware resources •To increase search throughput, scaling up is also an option How many replicas are needed?

Slide 31

Slide 31 text

•Partitioning of documents by some „routing“ value •Default: document ID hash Sharding Node 1 P1 Node 2 P2 P3

Slide 32

Slide 32 text

•Scale out •Distribute a large index onto multiple machines Sharding beneﬁts Node 1 P1 Node 2 P2 P3

Slide 33

Slide 33 text

•Increase capacity for write operations •Inserts, updates, deletes (and merges!) Sharding beneﬁts Node 1 P1 Node 2 P2 P3

Slide 34

Slide 34 text

•Parallelize searches •Unit of work in the search thread pool: the shard Sharding beneﬁts Node 1 P1 Node 2 P2 P3

Slide 35

Slide 35 text

•Distributed search requires coordination •Need to aggregate results from different shards •Similar to aggregating results from the segments of a shard Sharding drawbacks Node 1 P1 Node 2 P2 P3

Slide 36

Slide 36 text

•Bad: Number of shards needs to be set on index creation •Finding the right number requires some care •Formulate assumptions/expectations •Test and measure •Overallocate a little •Maximum shard size? •Often cited: 50 GB •Mainly a rule of thumb for quick recovery How many shards are needed?

Slide 37

Slide 37 text

•Maybe you can just use multiple indices? •Searching multiple indices is easy •Indices are more ﬂexible (e.g., creation, deletion) •But: Every index consumes certain resources •Cluster state, in-memory data structures •Recommendation: Shard an index if… •…you suspect that one shard might not be enough •…and there is no indicator for a „smarter“ approach When to shard?

Slide 38

Slide 38 text

User-based approach Index 1 ... User 1 User 5 User 4 User 6 User 7 User 8 Search by user 1 P1 Index 1 P2 Index 1 P3 User 9 P1 Index 2 Virtual index User 3 User 2

Slide 39

Slide 39 text

Time-based approach ... Search within the last 3 days P1 2016-03-20 P1 2016-04-20 P1 2016-04-19 P1 2016-04-18 Virtual index

Slide 40

Slide 40 text

•Separate concerns •Master nodes •Data nodes •Client/aggregator nodes •Client applications •HTTP client •TransportClient •NodeClient Cluster nodes

Slide 41

Slide 41 text

Mapping

Slide 42

Slide 42 text

Examples "filename" : { "type" : "string", "index" : "not_analyzed" } "filename_german" : { "type" : "string", "index" : "analyzed", "analyzer" : "german" } "filename_fancy" : { "type" : "string", "index" : "analyzed", "analyzer" : "my_fancy_analyzer" }

Slide 43

Slide 43 text

•Which fields to analyze, and how? •Which data to store for analyzed fields? •Term frequencies, positions, offsets? •Field norms? •Term vectors? •Which fields not to index at all? Indexing fields

Slide 44

Slide 44 text

•Consider indexing fields multiple times •Index time vs. query time solutions •multi-fields, copy_to •Disable unneeded multiple indexing done by default •Need the _all field? •Need raw fields? Indexing fields multiple times

Slide 45

Slide 45 text

•Be careful with dynamic mapping/templates •May lead to huge mappings (cluster state) •For known unknowns, consider the key-value pattern •Define just two fields: key and value Indexing unknown fields

Slide 46

Slide 46 text

•Do you need to store the whole _source? •Needed for, e.g., Reindex API, Update API •Can you exclude some fields from the _source? •Do you need to store _source at all? •Disable _source and only store a few selected fields? Storing fields

Slide 47

Slide 47 text

Indexing

Slide 48

Slide 48 text

•Limit the size of potentially large document ﬁelds •And hope that no one notices •Huge documents can OutOfMemory your cluster Limit input size

Slide 49

Slide 49 text

•Update = Read, Delete, Create •To replace a whole document, just index it again •Reduces network trafﬁc •Specify update as partial document or script •Update by ID or by query •Small updates might take a while •A single expensive ﬁeld is enough Update API

Slide 50

Slide 50 text

•Parent-child relationships •Model 1:N relations between documents •Advantage: Individual updates but combined queries •Warning: Performance issues with frequent refreshes •Observed query slowdowns between 300 ms and 5 seconds Relations

Slide 51

Slide 51 text

•Reduces overhead •Less network overhead •Only one translog fsync per bulk •Optimum bulk size depends on the document size •When in doubt, prefer smaller bulks •Still hitting a limit with bulk indexing? •The bottleneck might not be at the server •Try concurrent indexing with multiple clients Bulk indexing

Slide 52

Slide 52 text

•Depends on many factors •External data source? •Zero downtime? •Live index? Update API usage? Versioning? Possible deletes? •Ways to speed up reindexing •Bulk indexing •Disable refresh •Decrease number of replicas •The Reindex API only covers some scenarios Reindexing

Slide 53

Slide 53 text

Slide 54

Slide 54 text

•Limit the amount of data transferred •Don’t request more hits than needed •Don’t return ﬁelds not needed •Limit the amount of indexes/shards queried •Only query those where hits are possible •Request aggregations/facets only when needed •Might not have changed when requesting the next results page Reduce search overhead

Slide 55

Slide 55 text

•Avoid deep pagination •Sorting millions of documents is expensive •To iterate over lots of documents use scroll search •Sort by _doc and use the scroll parameter Deep pagination

Slide 56

Slide 56 text

•Search offset •Number of rows requested •Number of search terms •Total length of the search string Limit user input

Slide 57

Slide 57 text

•Some defaults reduce accuracy •Need more accurate scoring? •Set search_type to dfs_query_then_fetch •But: one more round trip •What is accurate scoring anyway? (e.g., deleted documents) •Need more accurate counts in aggregations? •Set shard_size higher than size •But: more work for each shard Accuracy vs. speed trade-offs with sharding

Slide 58

Slide 58 text

•Force Merge API (aka Optimize) •Turning 20 segments into 1 can be highly beneﬁcial •But: merges will invalidate caches •Most useful for indices not modiﬁed anymore Optimize indices?

Slide 59

Slide 59 text

•Page cache (OS) •Node query cache •Shard request cache •Disabled by default •Field data cache •Not as relevant as it used to be Caches

Slide 60

Slide 60 text

•Proﬁle API •Detailed timing analysis for a query •Slow log Slow queries

Slide 61

Slide 61 text

Misc

Slide 62

Slide 62 text

•Java: Avoid blacklisted versions •GC: Use CMS •Heap size •Measure how much is needed •No more than roughly 30 GB (enables pointer compression) •Number of processors? •Set JVM options (defaults based on OS virtual processors) •Set the Elasticsearch processors conﬁguration property JVM

Slide 63

Slide 63 text

•DRAM: The more, the better •Page cache is crucial for performance •Disk: Local SSD is best •CPU: 8 cores are nice •Consider separating into hot and cold nodes •Set up allocation constraints Hardware resources

Slide 64

Slide 64 text

•Monitoring •Check out the API •Detailed case studies •Lots of examples on the web •Configuration may differ between Elasticsearch versions •The official channels (GitHub, forum, documentation) are great •Outdated (mostly pre 2.x) topics •Field data without doc_values, filters vs. queries, scan+scroll, split brain, unnecessary recovery, cluster state without diffs There is more

Slide 65

Slide 65 text

codecentric AG Merscheider Straße 1 42699 Solingen, Deutschland tel: +49 (0) 212.23362854 fax: +49 (0) 212.23362879 [email protected] www.codecentric.de blog.codecentric.de Questions? Dr. Patrick Peschlow Head of Development - CenterDevice