Elasticsearch - from novice to expert

codecentric AG Patrick Peschlow Elasticsearch - from novice to expert

codecentric AG Crash course (demo with Sense) − Introduction −
Quickstart − Analysis − Mapping − Search features − Sharding + Replication

codecentric AG The road to expertise − Get the basics
right − Map carefully − Tune analysis incrementally − Understand filters − Know about Lucene − Don’t let your cluster fool you − Use index aliases − Learn about plugins

codecentric AG The road to expertise Get the basics right

codecentric AG The road to expertise Map carefully

codecentric AG Map carefully − Disable the _all field, you
definitely don’t want it ! − Keep the _source field enabled and don’t set any fields to _stored ! − Disable dynamic mapping (except where really needed) ! − Choose analyzers carefully (maybe even not_analyzed is enough?) ! − Consider mapping fields more than once, depending on the requirements ! − Existing mappings cannot be changed without deleting the type

codecentric AG The road to expertise Tune analysis incrementally

codecentric AG Tune analysis incrementally − Don’t guess ! −
Use the explain feature for search result scores and query rewriting ! − Prevent regressions by having a comprehensive unit test suite − In Java land: embed Elasticsearch − Test expectations about matches ! − Make sure your analyzers work correctly − Use the analyze API (and maybe the extended-analyze plugin) ! − Understand why queries match or don’t match − Use Luke to see what’s in the index − http://rosssimpson.com/blog/2014/05/06/using-luke-with-elasticsearch/

codecentric AG The road to expertise Understand filters

codecentric AG Understand filters − Use filters instead of queries
whenever you don’t need scoring − Many filters can get cached − You can even do filters-only (constant_score or match_all) ! − Compound filters (bool/and/or/not) are not cached − But you can still explicitly request caching by setting _cache ! − Prefer bool filters over and/or/not when combining cached filters − and/or/not don’t use the cache ! − Consider the scope of filters − May be applied before or after the query − Affects the scope of facets/aggregations − Often, „filtered query“ is what you need

codecentric AG The road to expertise Know about Lucene

codecentric AG Lucene internals

codecentric AG Lucene internals Segment flush()

codecentric AG Lucene internals Segment flush() Segment flush()

codecentric AG Lucene internals Segment flush() Segment flush() commit() Synced
to Disk

codecentric AG Lucene internals Visible to newly opened readers Segment
flush() Segment flush() commit() Synced to Disk If desired, visible via NRT

codecentric AG Lucene internals Segment flush() Segment flush() commit() Synced
to Disk Executed heuristically (or explicitly via NRT) Explicit call (transaction)

codecentric AG Transaction log

codecentric AG Transaction log Persisted

codecentric AG Transaction log Persisted refresh()

codecentric AG Transaction log Persisted + Reopen reader  for NRT
refresh() Segment flush()

codecentric AG Transaction log flush() Persisted + Reopen reader  for
NRT refresh() Segment flush()

codecentric AG Transaction log Segment flush() Synced to Disk Persisted
refresh() Segment flush() commit() + Reopen reader

codecentric AG Transaction log Segment flush() Persisted refresh() Segment flush()
Executed heuristically Executed regularly commit() Synced to Disk + Reopen reader

codecentric AG Transaction log All documents persisted and searchable: Transaction
log can be cleared

codecentric AG Update API − Lucene doesn’t know updates !
− Elasticsearch offers two approaches − Partial document − Script ! − Attention − Update = Delete + Add − Updates require _source − Partial document merges inner objects instead of replacing them

codecentric AG Relations − Lucene documents are flat ! −
Elasticsearch offers two alternatives − Nested objects − Parent/child mapping

codecentric AG The road to expertise Don’t let your cluster
fool you

codecentric AG Cluster state − Shard state − red =
Primary shard not allocated − yellow = Primary shard allocated but not all replicas − green = All shared allocated ! − Index state = Worst state of all shards of the index ! − Cluster state = Worst state of all indexes of the cluster

codecentric AG Things to consider − Access − Choose a
unique cluster name − Consider unicast vs. multicast discovery ! − Allocation awareness − Supports arbitrary rules to place shards or indexes on nodes ! − Nodes can have different roles: master, data, client ! − Pay attention to these settings: − minimum_master_nodes − gateway.recover_after_nodes − gateway.expected_nodes

codecentric AG Write and read consistency − „consistency“ − all,
quorum (default), one − How many shards need to be available to permit an operation ! − „replication=async“ − Return after the primary shard has safely stored the document − By default returns only after full replication is completed ! − „preference“ − On which shards to execute a search (default: round robin) − Possible values: local, primary, only some shards or nodes, arbitrary string

codecentric AG The road to expertise Use index aliases

codecentric AG Index alias − A logical name for one
or more Elasticsearch index(es) − Decouples client view from physical storage ! − Use cases: − Zero downtime re-indexing − (Read-only) views on multiple indices ! − May be associated with a query − Interesting for implementing access control

codecentric AG Thoughts on scalability − Choose number of shards
depending on estimation and measurements − A little overallocation is OK − But not too much, as shards don’t come for free ! − If the amount of data exceeds the available shards, index aliases may help − Create another, identically configured index − Add new documents to the new index − Define an alias so that search considers both indexes − Advice: Work with aliases right from the start ! − Remember: − Search in an index with 50 shards = Search in 50 indexes with one shard each − In both cases, 50 Lucene indexes are searched

codecentric AG The road to expertise Learn about plugins

codecentric AG Resources − The official blog  http://www.elasticsearch.org/blog/ ! −
The official book  http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/index.html ! − The official reference  http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html ! − Great blog  https://www.found.no/foundation/ ! − The Sense examples shown in this talk  https://gist.github.com/peschlowp/3aa550665ce3a417b617

codecentric AG Questions? Dr. rer. nat. Patrick Peschlow  codecentric AG 
Merscheider Straße 1  42699 Solingen    tel +49 (0) 212.23 36 28 54  fax +49 (0) 212.23 36 28 79  [email protected]    www.codecentric.de

Elasticsearch - from novice to expert

Elasticsearch - from novice to expert

More Decks by Patrick Peschlow

Other Decks in Technology

Featured

Transcript