Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch - from novice to expert

Elasticsearch - from novice to expert

Presentation held at the Coding Serbia Meetup in Novi Sad, Serbia, on September 25, 2014. Apart from the slides, the presentation featured various examples demonstrated using the Elasticsearch Marvel/Sense plugin. The slides contain a link to a gist showing those examples.

00655e17a4f690cb462153f921f8eb77?s=128

Patrick Peschlow

September 25, 2014
Tweet

Transcript

  1. codecentric AG Patrick Peschlow Elasticsearch - from novice to expert

  2. codecentric AG Crash course (demo with Sense) − Introduction −

    Quickstart − Analysis − Mapping − Search features − Sharding + Replication
  3. codecentric AG The road to expertise − Get the basics

    right − Map carefully − Tune analysis incrementally − Understand filters − Know about Lucene − Don’t let your cluster fool you − Use index aliases − Learn about plugins
  4. codecentric AG The road to expertise Get the basics right

  5. codecentric AG The road to expertise Map carefully

  6. codecentric AG Map carefully − Disable the _all field, you

    definitely don’t want it ! − Keep the _source field enabled and don’t set any fields to _stored ! − Disable dynamic mapping (except where really needed) ! − Choose analyzers carefully (maybe even not_analyzed is enough?) ! − Consider mapping fields more than once, depending on the requirements ! − Existing mappings cannot be changed without deleting the type
  7. codecentric AG The road to expertise Tune analysis incrementally

  8. codecentric AG Tune analysis incrementally − Don’t guess ! −

    Use the explain feature for search result scores and query rewriting ! − Prevent regressions by having a comprehensive unit test suite − In Java land: embed Elasticsearch − Test expectations about matches ! − Make sure your analyzers work correctly − Use the analyze API (and maybe the extended-analyze plugin) ! − Understand why queries match or don’t match − Use Luke to see what’s in the index − http://rosssimpson.com/blog/2014/05/06/using-luke-with-elasticsearch/
  9. codecentric AG The road to expertise Understand filters

  10. codecentric AG Understand filters − Use filters instead of queries

    whenever you don’t need scoring − Many filters can get cached − You can even do filters-only (constant_score or match_all) ! − Compound filters (bool/and/or/not) are not cached − But you can still explicitly request caching by setting _cache ! − Prefer bool filters over and/or/not when combining cached filters − and/or/not don’t use the cache ! − Consider the scope of filters − May be applied before or after the query − Affects the scope of facets/aggregations − Often, „filtered query“ is what you need
  11. codecentric AG The road to expertise Know about Lucene

  12. codecentric AG Lucene internals

  13. codecentric AG Lucene internals Segment flush()

  14. codecentric AG Lucene internals Segment flush()

  15. codecentric AG Lucene internals Segment flush() Segment flush()

  16. codecentric AG Lucene internals Segment flush() Segment flush() commit() Synced

    to Disk
  17. codecentric AG Lucene internals Visible to newly opened readers Segment

    flush() Segment flush() commit() Synced to Disk If desired, visible via NRT
  18. codecentric AG Lucene internals Segment flush() Segment flush() commit() Synced

    to Disk Executed heuristically (or explicitly via NRT) Explicit call (transaction)
  19. codecentric AG Transaction log

  20. codecentric AG Transaction log Persisted

  21. codecentric AG Transaction log Persisted refresh()

  22. codecentric AG Transaction log Persisted + Reopen reader
 for NRT

    refresh() Segment flush()
  23. codecentric AG Transaction log Persisted + Reopen reader
 for NRT

    refresh() Segment flush()
  24. codecentric AG Transaction log flush() Persisted + Reopen reader
 for

    NRT refresh() Segment flush()
  25. codecentric AG Transaction log Segment flush() Synced to Disk Persisted

    refresh() Segment flush() commit() + Reopen reader
  26. codecentric AG Transaction log Segment flush() Persisted refresh() Segment flush()

    Executed heuristically Executed regularly commit() Synced to Disk + Reopen reader
  27. codecentric AG Transaction log All documents persisted and searchable: Transaction

    log can be cleared
  28. codecentric AG Update API − Lucene doesn’t know updates !

    − Elasticsearch offers two approaches − Partial document − Script ! − Attention − Update = Delete + Add − Updates require _source − Partial document merges inner objects instead of replacing them
  29. codecentric AG Relations − Lucene documents are flat ! −

    Elasticsearch offers two alternatives − Nested objects − Parent/child mapping
  30. codecentric AG The road to expertise Don’t let your cluster

    fool you
  31. codecentric AG Cluster state − Shard state − red =

    Primary shard not allocated − yellow = Primary shard allocated but not all replicas − green = All shared allocated ! − Index state = Worst state of all shards of the index ! − Cluster state = Worst state of all indexes of the cluster
  32. codecentric AG Things to consider − Access − Choose a

    unique cluster name − Consider unicast vs. multicast discovery ! − Allocation awareness − Supports arbitrary rules to place shards or indexes on nodes ! − Nodes can have different roles: master, data, client ! − Pay attention to these settings: − minimum_master_nodes − gateway.recover_after_nodes − gateway.expected_nodes
  33. codecentric AG Write and read consistency − „consistency“ − all,

    quorum (default), one − How many shards need to be available to permit an operation ! − „replication=async“ − Return after the primary shard has safely stored the document − By default returns only after full replication is completed ! − „preference“ − On which shards to execute a search (default: round robin) − Possible values: local, primary, only some shards or nodes, arbitrary string
  34. codecentric AG The road to expertise Use index aliases

  35. codecentric AG Index alias − A logical name for one

    or more Elasticsearch index(es) − Decouples client view from physical storage ! − Use cases: − Zero downtime re-indexing − (Read-only) views on multiple indices ! − May be associated with a query − Interesting for implementing access control
  36. codecentric AG Thoughts on scalability − Choose number of shards

    depending on estimation and measurements − A little overallocation is OK − But not too much, as shards don’t come for free ! − If the amount of data exceeds the available shards, index aliases may help − Create another, identically configured index − Add new documents to the new index − Define an alias so that search considers both indexes − Advice: Work with aliases right from the start ! − Remember: − Search in an index with 50 shards = Search in 50 indexes with one shard each − In both cases, 50 Lucene indexes are searched
  37. codecentric AG The road to expertise Learn about plugins

  38. codecentric AG Resources − The official blog
 http://www.elasticsearch.org/blog/ ! −

    The official book
 http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/index.html ! − The official reference
 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index.html ! − Great blog
 https://www.found.no/foundation/ ! − The Sense examples shown in this talk
 https://gist.github.com/peschlowp/3aa550665ce3a417b617
  39. codecentric AG Questions? Dr. rer. nat. Patrick Peschlow
 codecentric AG


    Merscheider Straße 1
 42699 Solingen
 
 tel +49 (0) 212.23 36 28 54
 fax +49 (0) 212.23 36 28 79
 patrick.peschlow@codecentric.de
 
 www.codecentric.de