Elasticsearch in high-traffic website

How to get a stable Elasticsearch cluster in high- traffic
website?

- Web agency founded in 2007, 20 developers (Android/iOS/backend), work
on mobile apps : Air France, SeLoger, L'Oréal, Louis Vuitton, ING Direct, DirectAssurance, ... _type = < FactoricsProject > - Collect and store trafic data (anonymous) for all apps in Elasticsearch cluster - Serve and display data on web backoffice - Use collected data to create (re)targeted in-app or push notification campaigns _index = < MyStudioFactory >

_id = < camilo_sierra > Lead Dev Search - blueKiwi
Software - 7 nodes Lead Dev Elasticsearch - MyStudioFactory - 1 node (365 documents *7) - Integration and management of Exalead - Migration from Exalead to Elasticsearch - Migration from Mysql (as analytics server) to Elasticsearch - I proposed and built Elasticsearch stable cluster design for real time analytics (logstash, kafka...) - Push campaigns using ES to filter and get mobile device tokens [email protected]

What do we want when we use Elasticsearch?

Speed Cluster stability - Queries < cache / fielddata >
- Infrastructure < Master-Data-Client > - How many Shards ?

Let's explain using an example...

Imagine that we have a forum to index in Elasticsearch
- Personal information for each user - Discussions and comments - And information about group/circles to link users

S e r v e r 1 S e r
v e r 2 S e r v e r 3 C-D-(M) C-D-(M) C-D-M* C = Client D = Data M* = elected Master M = eligible as Master for server 1, 2 and 3 : CPU : 10 physical cores @ 2.80GHz RAM : 256GB or more... Disques : SSD 300GB or more...

- publish & comment discussions - search in discussions &
files - profile updates - create & join new groups/circles Peak hour at 5pm with 75% of users connected - add discussions in favorites...

What can happen at 5pm - Heap skyrockets ! -
To avoid this we have to change the infrastructure & requests, to keep our forum on Earth - Increase in garbage collection activity causing increased CPU usage

Domino effect C-D-M* If the JVM is not responding for
several seconds and our node was the master, a new election needs to happen, and the same issue can happen immediately after on the newly elected master, this could lead to a lot of instability in the cluster *even if it is not the master node that goes down, the rebalancing could take time and make your cluster sweat.

Virtualization! Large heaps have the disadvantage of taking longer to
collect, and this can cause cluster instability. ... Divide and conquer - don’t Cross 32 GB limit for heap memory! - set cluster.routing.allocation.same_shard.host

Client nodes : How to organize all this nodes ?
Master node : Data nodes : - they know where data is stored and can query the right shards directly and merge the results back - it is the source of truth of the cluster and manages the cluster state - the only node type who stores data, they are used for both indexing and search *D'ont use master as a client, that can result in unstable state after big aggregations, heavy sorting/scripts... - they keep data nodes protected behind a firewall with only client nodes being allowed to talk to them

C C C M* M M D C C D
D D D D D ...

Helpful Tips - Setting the minimum number of eligible nodes
to 2, the cluster will still be able to work in case of a loss of one of the master nodes - Leave half of the system's memory for the filesystem cache - Set small heap size (eg. 1GB might be enough) for dedicated master nodes so that they don't suffer from garbage collection pauses. - If HTTP module is not disabled in the master eligible nodes, they can also serve as result servers, collect shard responses from other nodes for sending the merge result back to clients, without having to search or index

Keep balance in shards The shard must be small enough
so that the hardware handling it will cope. There is no technical limit on the size of a shard, there is a limit to how big a shard can be with respect to your hardware - In our example we are going to keep each shard's size between 1 & 4GB, this allow's us to have fast queries and quickly resharding after restart or node goes down - If the shards grow too big, you have the option of rebuilding the entire Elasticsearch index with more shards, to scale out horizontally or split your index (per time period, per user...) Attention, once you're dealing with too many shards, the benefits of distributing the data gets pitted against the cost of coordinating between all of them, and it incurs a significant cost.

Field data & Cache - Fielddata is used in sorting,
aggregations & scripts, and can take a lot of RAM so it makes sense to disable field data loading on fields that don’t need it, for example those that are used for full-text search only - The results of each request is cached and is reused for future requests improving query performance, but building up and evicting filters over and over again for a continuous time period can induce to some very long garbage collections - But in Elasticsearch 2.0 filter caching changed, and it keeps track of the 256 most recently used filters, and only caches those that appear 5 times or more, ES 2.0 prefers to be sure filters are reused before caching them - If there are filters that you do not reuse, turn off caching explicitly

We're hiring, Elasticsearch devs!! - Thanks to Adrien Grand, the
core training, and ES support team for their help - [email protected] - 21 Sept 2015

Elasticsearch in high-traffic website

Elasticsearch in high-traffic website

Camilo Sierra

More Decks by Camilo Sierra

Other Decks in Programming

Featured

Transcript

How to get a stable Elasticsearch cluster in high- traffic

- Web agency founded in 2007, 20 developers (Android/iOS/backend), work

_id = < camilo_sierra > Lead Dev Search - blueKiwi

What do we want when we use Elasticsearch?

Speed Cluster stability - Queries < cache / fielddata >

Let's explain using an example...

Imagine that we have a forum to index in Elasticsearch

S e r v e r 1 S e r

- publish & comment discussions - search in discussions &

What can happen at 5pm - Heap skyrockets ! -

Domino effect C-D-M* If the JVM is not responding for

Virtualization! Large heaps have the disadvantage of taking longer to

Client nodes : How to organize all this nodes ?

C C C M* M M D C C D

Helpful Tips - Setting the minimum number of eligible nodes

Keep balance in shards The shard must be small enough

Field data & Cache - Fielddata is used in sorting,

We're hiring, Elasticsearch devs!! - Thanks to Adrien Grand, the