Two Years of Elasticsearch in Development and Production

codecentric AG Patrick Peschlow Two Years of Elasticsearch in Development
and Production

codecentric AG Mapping − Disable the _all field (unless you
really need it) ! − Prefer _source over _stored − _source is useful anyway (for updates, reindexing, highlighting) ! − Only analyze/compute what you need − not_analyzed, field norms, term frequencies and positions ! − Be careful with dynamic mapping and dynamic templates − Can lead to undesired fields or types in the index − Can considerably grow the cluster state

codecentric AG Queries − Pagination − Don’t load too many
results with a single query − Avoid deep pagination − Consider using the scan+scroll API when you don’t need sorting ! − Think about index-time vs. query-time solutions − Prefix query vs. edge ngrams ? − Sorting via script vs. indexing another field ? − Don’t be afraid to index a source field twice

codecentric AG Filters and Caching − Use filters for yes/no
criteria that don’t need scoring − In contrast to queries, filter results can be cached ! − Tricky caching behavior − Some filters are cached by default, others not (depends on cost) − Caching may also depend on how often filters are used − Pay special attention to compound filters ! − Possible to override caching behavior and cache key

codecentric AG Filters and Ordering − Elements of bool filters
are executed sequentially − Place more selective filters first ! − Consider using „accelerator“ filters − Redundant filters that reduce work for heavyweight filters ! − Learn about possible „strategy“ settings for filtered queries − Controls how filter and query parts are interleaved − Measure, don’t guess ! − Note: With ES 2.0 queries and filters might get unified

codecentric AG Analysis Tooling − Use the search/explain feature (score
computation) ! − Use the validate/explain feature (query rewriting, cache usage) ! − Make sure your analyzers work correctly − Use the analyze API − Check out the „inquisitor“ and „extended-analyze“ plugins ! − When in doubt, take a look at the terms in your index − http://rosssimpson.com/blog/2014/05/06/using-luke-with-elasticsearch/ − „skywalker“ plugin

codecentric AG Replication and Search Preference − With replicas, we
can get different results for the same search − Searches are routed to replicas in „round robin“ fashion − Deleted documents still affect scoring − Segment merging (physical deletion) can differ among replicas ! ! ! ! ! − Solution: Use the search „preference“ parameter − For consistent results by user, choose user ID as preference doc1 doc2 doc3 doc4 doc1 doc2

codecentric AG Aggregations (Facets) − Load aggregations as lazily as
possible − Do you really need to offer all of them on the UI right away? − Can you hide some less relevant ones by default? ! − Only load aggregations once when retrieving paginated results − Consider not requesting them again when just switching the page − They likely stay the same ! − Many aggregations use approximation algorithms − Don’t expect results to be 100% true

codecentric AG Field Data − Some operations require document field
data − Sorting, aggregation, parent-child queries, some scripts ! − Field data is usually loaded for all documents − Leads to high memory consumption or OutOfMemoryError ! − Use „doc values“: Store field data on the file system − Let the OS do the caching − Can be enabled on a per-field basis ! − Note: With ES 2.0 „doc values“ might become the default

codecentric AG Unit/Integration Testing − Set up a comprehensive test
suite − Test expectations about matches − Prevent regressions when changing or modifying analyzers ! − The Elasticsearch Java client is embeddable − No mocks or test doubles needed ! − Try it by solving the „mapping challenge“ ! − https://github.com/peschlowp/elasticsearch-mapping-challenge

codecentric AG Indexing and Real-Time Requirements − Default refresh interval:
1 second − Targeted at human users ! − What if API clients want RYOW semantics for search ? − Refresh after every request ? ! − Recommendation: Leave RYOW to the primary database, if at all − Provide a separate API if needed

codecentric AG Bulk Indexing − For optimum bulk size, consider
document size not count ! − Be careful with merge throttling − Elasticsearch might throttle indexing anyway − Look out for „now throttling indexing“ log messages − Is it worth it? ! − Decrease refresh rate (or disable completely) ! − Reduce number of replicas (or set to zero) − Add missing replicas later, much cheaper than „live“ replication

codecentric AG Update API − Update = Delete + Add
− Only saves network traffic ! − Even small updates might take a while − Consider splitting (nested documents or parent-child relationships) ! − „Partial document“ update trickiness − Fields are replaced, except for inner objects which are merged − To replace inner objects, consider wrapping them in an array

codecentric AG Cluster settings − Safety − Choose a unique
cluster name − Consider using unicast discovery ! − Recovery − gateway.recover_after_nodes − gateway.recover_after_time − gateway.expected_nodes ! − Stability − minimum_master_nodes

codecentric AG Split Brain ! ! ! ! ! !
! ! ! − Prevent split brains by partitioning − Set minimum_master_nodes to quorum

! ! ! − Prevent split brains when single links fail − Upgrade to ES 1.4.x

! ! ! − Monitor the cluster for split brains − Ask each node who is master − Use the cat master API

codecentric AG Dedicated Master Nodes master Node 1 Other nodes
master Node 3 Node 2 master

codecentric AG Distributed Search Client Compute global statistics Get local
top hits Get global top hits fields

codecentric AG Aggregator Nodes Node 1 data Node 2 data
Search client Node 3

codecentric AG Aggregator Nodes Node 1 data Node 2 data
client Node 3 Indexing preferable

codecentric AG Java Clients − NodeClient − Joins the cluster
as a client node − Potentially saves a network hop − Will participate in distributed searches ! − TransportClient − More lightweight than NodeClient ! − Some HTTP Client − Smaller memory footprint − Pay attention to settings: Chunking, long-lived HTTP connections

codecentric AG Some Stories from Production − The close/open gamble
! − Last resort single node ! − The devastating query ! − About upgrades

codecentric AG Designing for Scalability − Think about scaling right
from the start − Fixed number of shards per index − Shard key cannot be changed later − Distributed searches are expensive ! − Patterns in the data can be used for optimization − Time-based data − User-based data

codecentric AG User-based Data: Separate Indexes Index 1 Index 2
Index N ... User 1 User 2 User N ! ! ! ! ! ! ! ! ! ! − Disadvantage: Resource consumption, larger cluster state

codecentric AG User-based Data: Shared Index Shard 1 Shard 2
Shard M ... Search by user 1 filter by user 1 ! ! ! ! ! ! ! ! ! ! − Disadvantage: Distributed search

codecentric AG filter by user 1 User-based Data: Shared Index
with Routing Shard 1 Shard 2 Shard M ... User 2 User 1 User 5 User 3 User 4 User 6 User N User N-1 Search by user 1 ! ! ! ! ! ! ! ! ! ! − Disadvantage: At most one shard per user (capacity)

codecentric AG User-based Data: Aliases − With aliases the approach
chosen can be hidden from clients − Aliases can even carry filter and routing information − Present separate „user“ indexes (aliases) to the client ! − Advantage − Flexibility: Adapt mapping to physical indexes/shards on demand ! − Limitation − Huge number of users means lots of aliases (cluster state) − Still much better than huge number of indexes

codecentric AG Zero Downtime Migration − Possible reasons − Backwards-incompatible
mapping changes − Index/shard reaches its capacity ! − Needs a lot of careful thinking − Especially challenging if the update API is used

codecentric AG Questions? Dr. rer. nat. Patrick Peschlow  codecentric AG 
Merscheider Straße 1  42699 Solingen    tel +49 (0) 212.23 36 28 54  fax +49 (0) 212.23 36 28 79  [email protected]    www.codecentric.de

Two Years of Elasticsearch in Development and P...

Two Years of Elasticsearch in Development and Production

Patrick Peschlow

More Decks by Patrick Peschlow

Other Decks in Technology

Featured

Transcript

codecentric AG Patrick Peschlow Two Years of Elasticsearch in Development

codecentric AG Mapping − Disable the _all field (unless you

codecentric AG Queries − Pagination − Don’t load too many

codecentric AG Filters and Caching − Use filters for yes/no

codecentric AG Filters and Ordering − Elements of bool filters

codecentric AG Analysis Tooling − Use the search/explain feature (score

codecentric AG Replication and Search Preference − With replicas, we

codecentric AG Aggregations (Facets) − Load aggregations as lazily as

codecentric AG Field Data − Some operations require document field

codecentric AG Unit/Integration Testing − Set up a comprehensive test

codecentric AG Indexing and Real-Time Requirements − Default refresh interval:

codecentric AG Bulk Indexing − For optimum bulk size, consider

codecentric AG Update API − Update = Delete + Add

codecentric AG Cluster settings − Safety − Choose a unique

codecentric AG Split Brain ! ! ! ! ! !

codecentric AG Split Brain ! ! ! ! ! !

codecentric AG Split Brain ! ! ! ! ! !

codecentric AG Dedicated Master Nodes master Node 1 Other nodes

codecentric AG Distributed Search Client Compute global statistics Get local

codecentric AG Aggregator Nodes Node 1 data Node 2 data

codecentric AG Aggregator Nodes Node 1 data Node 2 data

codecentric AG Java Clients − NodeClient − Joins the cluster

codecentric AG Some Stories from Production − The close/open gamble

codecentric AG Designing for Scalability − Think about scaling right

codecentric AG User-based Data: Separate Indexes Index 1 Index 2

codecentric AG User-based Data: Shared Index Shard 1 Shard 2

codecentric AG filter by user 1 User-based Data: Shared Index

codecentric AG User-based Data: Aliases − With aliases the approach

codecentric AG Zero Downtime Migration − Possible reasons − Backwards-incompatible

codecentric AG Questions? Dr. rer. nat. Patrick Peschlow  codecentric AG