ES random facts @ Perfectial

elastic ecosystem random tales

open source written in java originally was a simple wrapper
around lucene great community clients available for all sane languages elastic team is biggest contributor to lucene by couple of past years worth learning at least in terms of DB arhitecture

CAP in simple terms: does not limited to NoSQL by
any means never been proven most RDBMS are C+A most [good] NoSQLs are A+P given consistency, availability, partition-tolerance, choose two “ “

ES & CAP [highly] available partition-tolerant eventually consistent C <>
A can be slightly tuned by wait for active shards ?refresh

ACID does not limited to RDBMS by any means vague
interpretations often [con]fused by modern RDBMS transactions

ES & ACID ACID in modern terms isn't supported. But:
atomicity per document is guaranteed ES uses WAL for all writes optimistic concurrency control by versions [1] [2] on-side scripting support global, document-level, tree locking durability is managable (OOM, split-brain)

Sharding indices may be splitted into shards indices may require
replicas (sync for all shards) replica is never allocated to same node as primary shard replicas is basically slave in master-slave replication schema searches/aggregations hits replicas each shard is lucene index should be planned ahead

Cluster & failover stuff ES automatically syncs new cluster nodes:
shards being rebalanced to new nodes shards being synced via WAL to recovered nodes All rebalance and recovery is automatic (still controllable tho) Transport inside cluster in non-blocking

Node roles master-eligible node (can be elected) data node (storage,
CRUD, search, map stage of aggregation) ingest node (pre-save actions) tribe node (inter-cluster communication) coodrinator [only] node (buffer for bulk indexing, reduce stage worker)

Java "native" client is basically an emulation of coordinating-only node
with internal transport protocol

Node discovery depends on platform kind of pluggable default is
zen-discovery Pay attention to master election settings (if you have access to it) Worst thing you can get is split-brain

Routing all documents being routed to primary shard based on
ketama-like distribution. Default seed is _id you can manually route document by providing external _routing parent-child relations restricts routing - child docs must be routed to the same shard as parent

Search index lifespan Indices are divided into immutable segments newly
indexed documents are being added to im- memory segment in-memory segment are being ushed to disk after reaching buffer or time limit on-disk segments being merged after reaching threshold updates held in separate index, merged with normal ones in merge stage

segment merge is kinda heavy itself more uniform index rate
leads to less stress of CPU and GC updates and deletes are heavy - both in terms of search and segment merge segment merge can be forced being big part of performance, this should be monitored and tuned appropriately. Generic JVM heap + GC tuning is way to go

Mapping Common misconception is that ES is "schema-less". ES stores
and requires mapping to index docs mapping can be created and altered on the y mapping can be changed in case of backward- compatible changes adding new elds is cheap

If you plan to use ES in prod, treat it
as schema-intensive. don't run by the rake-road, use explicit mapping dynamic mapping creation can be controlled dynamic eld mapping can be ~ controlled automatic index creation can be controlled

Relations ... aren't supported as you want them Nested documents
"hidden" separate docs, accessible via parent tricky to work with if you need nested-only heavy on updates, yweight on retrieve Parent-child sometimes tricky to retrieve easy on update, heavier on CPU

Indices tricks simultaneous operations in multiple indices index creation support
templates index [ ltered] aliases This. Is. Awesome. zero downtime reindex user/analytic realms seamless index switch event log "rotation"

Random stuff

All lucene text-search features available fuzzy search levenstein distance positive
and negative rescoring on the y synonyms etc

Elasticsearch is awesome [realtime] analytics engine

Indexing [optimizations] inverted index radix trees b-k-d trees offset +
GCD compression ordinals de ate

Geo data geo points with hashing geo shapes Querytime shape
bounding shape intersection distance [range] ltering and sorting geohash grid

tail scroll query push queries (percolation) pivot faceting (nested aggregations)
multi- elds - different processing for same data intensive caching with awesome invalidation custom aggregation pipelines hadoop connector stored procedures / scripting

Doc values If you consider using ES as analytics engine,
get hands on doc values [1] [2] Basically, this is column-storage for non-analyzed data, which makes aggregations and range queries really fast. This is default storage approach starting with version 5. Can be disabled anytime, being set per eld

ecosystem

Logstash

ETL server for event logs a lot of datasources custom
pipelining hell a lot of data backends pluggable

visualizations charts heatmaps free tile server for geodata polygonal heatmaps
data-aware shell (dev tool, formely sense ) reports graph builder integration

Awesome, available mostly via subscription, tools monitoring security alerting reporting
graph builder

ES random facts @ Perfectial

ES random facts @ Perfectial

Slam

Featured

Transcript

elastic ecosystem random tales

open source written in java originally was a simple wrapper

CAP in simple terms: does not limited to NoSQL by

ES & CAP [highly] available partition-tolerant eventually consistent C <>

ACID does not limited to RDBMS by any means vague

ES & ACID ACID in modern terms isn't supported. But:

Sharding indices may be splitted into shards indices may require

Cluster & failover stuff ES automatically syncs new cluster nodes:

Node roles master-eligible node (can be elected) data node (storage,

Java "native" client is basically an emulation of coordinating-only node

Node discovery depends on platform kind of pluggable default is

Routing all documents being routed to primary shard based on

Search index lifespan Indices are divided into immutable segments newly

segment merge is kinda heavy itself more uniform index rate

Mapping Common misconception is that ES is "schema-less". ES stores

If you plan to use ES in prod, treat it

Relations ... aren't supported as you want them Nested documents

Indices tricks simultaneous operations in multiple indices index creation support

Random stuff

All lucene text-search features available fuzzy search levenstein distance positive

Elasticsearch is awesome [realtime] analytics engine

Indexing [optimizations] inverted index radix trees b-k-d trees offset +

Geo data geo points with hashing geo shapes Querytime shape

tail scroll query push queries (percolation) pivot faceting (nested aggregations)

Doc values If you consider using ES as analytics engine,

ecosystem

Logstash

ETL server for event logs a lot of datasources custom

visualizations charts heatmaps free tile server for geodata polygonal heatmaps

Awesome, available mostly via subscription, tools monitoring security alerting reporting