Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ES random facts @ Perfectial

Slam
March 28, 2017
59

ES random facts @ Perfectial

Slam

March 28, 2017
Tweet

Transcript

  1. open source written in java originally was a simple wrapper

    around lucene great community clients available for all sane languages elastic team is biggest contributor to lucene by couple of past years worth learning at least in terms of DB arhitecture
  2. CAP in simple terms: does not limited to NoSQL by

    any means never been proven most RDBMS are C+A most [good] NoSQLs are A+P given consistency, availability, partition-tolerance, choose two “ “
  3. ES & CAP [highly] available partition-tolerant eventually consistent C <>

    A can be slightly tuned by wait for active shards ?refresh
  4. ACID does not limited to RDBMS by any means vague

    interpretations often [con]fused by modern RDBMS transactions
  5. ES & ACID ACID in modern terms isn't supported. But:

    atomicity per document is guaranteed ES uses WAL for all writes optimistic concurrency control by versions [1] [2] on-side scripting support global, document-level, tree locking durability is managable (OOM, split-brain)
  6. Sharding indices may be splitted into shards indices may require

    replicas (sync for all shards) replica is never allocated to same node as primary shard replicas is basically slave in master-slave replication schema searches/aggregations hits replicas each shard is lucene index should be planned ahead
  7. Cluster & failover stuff ES automatically syncs new cluster nodes:

    shards being rebalanced to new nodes shards being synced via WAL to recovered nodes All rebalance and recovery is automatic (still controllable tho) Transport inside cluster in non-blocking
  8. Node roles master-eligible node (can be elected) data node (storage,

    CRUD, search, map stage of aggregation) ingest node (pre-save actions) tribe node (inter-cluster communication) coodrinator [only] node (buffer for bulk indexing, reduce stage worker)
  9. Node discovery depends on platform kind of pluggable default is

    zen-discovery Pay attention to master election settings (if you have access to it) Worst thing you can get is split-brain
  10. Routing all documents being routed to primary shard based on

    ketama-like distribution. Default seed is _id you can manually route document by providing external _routing parent-child relations restricts routing - child docs must be routed to the same shard as parent
  11. Search index lifespan Indices are divided into immutable segments newly

    indexed documents are being added to im- memory segment in-memory segment are being ushed to disk after reaching buffer or time limit on-disk segments being merged after reaching threshold updates held in separate index, merged with normal ones in merge stage
  12. segment merge is kinda heavy itself more uniform index rate

    leads to less stress of CPU and GC updates and deletes are heavy - both in terms of search and segment merge segment merge can be forced being big part of performance, this should be monitored and tuned appropriately. Generic JVM heap + GC tuning is way to go
  13. Mapping Common misconception is that ES is "schema-less". ES stores

    and requires mapping to index docs mapping can be created and altered on the y mapping can be changed in case of backward- compatible changes adding new elds is cheap
  14. If you plan to use ES in prod, treat it

    as schema-intensive. don't run by the rake-road, use explicit mapping dynamic mapping creation can be controlled dynamic eld mapping can be ~ controlled automatic index creation can be controlled
  15. Relations ... aren't supported as you want them Nested documents

    "hidden" separate docs, accessible via parent tricky to work with if you need nested-only heavy on updates, yweight on retrieve Parent-child sometimes tricky to retrieve easy on update, heavier on CPU
  16. Indices tricks simultaneous operations in multiple indices index creation support

    templates index [ ltered] aliases This. Is. Awesome. zero downtime reindex user/analytic realms seamless index switch event log "rotation"
  17. Geo data geo points with hashing geo shapes Querytime shape

    bounding shape intersection distance [range] ltering and sorting geohash grid
  18. tail scroll query push queries (percolation) pivot faceting (nested aggregations)

    multi- elds - different processing for same data intensive caching with awesome invalidation custom aggregation pipelines hadoop connector stored procedures / scripting
  19. Doc values If you consider using ES as analytics engine,

    get hands on doc values [1] [2] Basically, this is column-storage for non-analyzed data, which makes aggregations and range queries really fast. This is default storage approach starting with version 5. Can be disabled anytime, being set per eld
  20. ETL server for event logs a lot of datasources custom

    pipelining hell a lot of data backends pluggable
  21. visualizations charts heatmaps free tile server for geodata polygonal heatmaps

    data-aware shell (dev tool, formely sense ) reports graph builder integration