Slide 1

Slide 1 text

How to get a stable Elasticsearch cluster in high- traffic website?

Slide 2

Slide 2 text

- Web agency founded in 2007, 20 developers (Android/iOS/backend), work on mobile apps : Air France, SeLoger, L'Oréal, Louis Vuitton, ING Direct, DirectAssurance, ... _type = < FactoricsProject > - Collect and store trafic data (anonymous) for all apps in Elasticsearch cluster - Serve and display data on web backoffice - Use collected data to create (re)targeted in-app or push notification campaigns _index = < MyStudioFactory >

Slide 3

Slide 3 text

_id = < camilo_sierra > Lead Dev Search - blueKiwi Software - 7 nodes Lead Dev Elasticsearch - MyStudioFactory - 1 node (365 documents *7) - Integration and management of Exalead - Migration from Exalead to Elasticsearch - Migration from Mysql (as analytics server) to Elasticsearch - I proposed and built Elasticsearch stable cluster design for real time analytics (logstash, kafka...) - Push campaigns using ES to filter and get mobile device tokens [email protected]

Slide 4

Slide 4 text

What do we want when we use Elasticsearch?

Slide 5

Slide 5 text

Speed Cluster stability - Queries < cache / fielddata > - Infrastructure < Master-Data-Client > - How many Shards ?

Slide 6

Slide 6 text

Let's explain using an example...

Slide 7

Slide 7 text

Imagine that we have a forum to index in Elasticsearch - Personal information for each user - Discussions and comments - And information about group/circles to link users

Slide 8

Slide 8 text

S e r v e r 1 S e r v e r 2 S e r v e r 3 C-D-(M) C-D-(M) C-D-M* C = Client D = Data M* = elected Master M = eligible as Master for server 1, 2 and 3 : CPU : 10 physical cores @ 2.80GHz RAM : 256GB or more... Disques : SSD 300GB or more...

Slide 9

Slide 9 text

- publish & comment discussions - search in discussions & files - profile updates - create & join new groups/circles Peak hour at 5pm with 75% of users connected - add discussions in favorites...

Slide 10

Slide 10 text

What can happen at 5pm - Heap skyrockets ! - To avoid this we have to change the infrastructure & requests, to keep our forum on Earth - Increase in garbage collection activity causing increased CPU usage

Slide 11

Slide 11 text

Domino effect C-D-M* If the JVM is not responding for several seconds and our node was the master, a new election needs to happen, and the same issue can happen immediately after on the newly elected master, this could lead to a lot of instability in the cluster *even if it is not the master node that goes down, the rebalancing could take time and make your cluster sweat.

Slide 12

Slide 12 text

Virtualization! Large heaps have the disadvantage of taking longer to collect, and this can cause cluster instability. ... Divide and conquer - don’t Cross 32 GB limit for heap memory! - set cluster.routing.allocation.same_shard.host

Slide 13

Slide 13 text

Client nodes : How to organize all this nodes ? Master node : Data nodes : - they know where data is stored and can query the right shards directly and merge the results back - it is the source of truth of the cluster and manages the cluster state - the only node type who stores data, they are used for both indexing and search *D'ont use master as a client, that can result in unstable state after big aggregations, heavy sorting/scripts... - they keep data nodes protected behind a firewall with only client nodes being allowed to talk to them

Slide 14

Slide 14 text

C C C M* M M D C C D D D D D D ...

Slide 15

Slide 15 text

Helpful Tips - Setting the minimum number of eligible nodes to 2, the cluster will still be able to work in case of a loss of one of the master nodes - Leave half of the system's memory for the filesystem cache - Set small heap size (eg. 1GB might be enough) for dedicated master nodes so that they don't suffer from garbage collection pauses. - If HTTP module is not disabled in the master eligible nodes, they can also serve as result servers, collect shard responses from other nodes for sending the merge result back to clients, without having to search or index

Slide 16

Slide 16 text

Keep balance in shards The shard must be small enough so that the hardware handling it will cope. There is no technical limit on the size of a shard, there is a limit to how big a shard can be with respect to your hardware - In our example we are going to keep each shard's size between 1 & 4GB, this allow's us to have fast queries and quickly resharding after restart or node goes down - If the shards grow too big, you have the option of rebuilding the entire Elasticsearch index with more shards, to scale out horizontally or split your index (per time period, per user...) Attention, once you're dealing with too many shards, the benefits of distributing the data gets pitted against the cost of coordinating between all of them, and it incurs a significant cost.

Slide 17

Slide 17 text

Field data & Cache - Fielddata is used in sorting, aggregations & scripts, and can take a lot of RAM so it makes sense to disable field data loading on fields that don’t need it, for example those that are used for full-text search only - The results of each request is cached and is reused for future requests improving query performance, but building up and evicting filters over and over again for a continuous time period can induce to some very long garbage collections - But in Elasticsearch 2.0 filter caching changed, and it keeps track of the 256 most recently used filters, and only caches those that appear 5 times or more, ES 2.0 prefers to be sure filters are reused before caching them - If there are filters that you do not reuse, turn off caching explicitly

Slide 18

Slide 18 text

We're hiring, Elasticsearch devs!! - Thanks to Adrien Grand, the core training, and ES support team for their help - [email protected] - 21 Sept 2015