Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch in high-traffic website

Camilo Sierra
September 22, 2015

Elasticsearch in high-traffic website

How to get a stable Elasticsearch cluster in high-traffic website?

Camilo Sierra

September 22, 2015
Tweet

More Decks by Camilo Sierra

Other Decks in Programming

Transcript

  1. How to get a
    stable
    Elasticsearch
    cluster in high-
    traffic website?

    View Slide

  2. - Web agency founded in 2007, 20 developers (Android/iOS/backend),
    work on mobile apps : Air France, SeLoger, L'Oréal, Louis Vuitton, ING
    Direct, DirectAssurance, ...
    _type = < FactoricsProject >
    - Collect and store trafic data (anonymous) for all apps in
    Elasticsearch cluster
    - Serve and display data on web backoffice
    - Use collected data to create (re)targeted in-app or push
    notification campaigns
    _index = < MyStudioFactory >

    View Slide

  3. _id = < camilo_sierra >
    Lead Dev Search - blueKiwi Software - 7 nodes
    Lead Dev Elasticsearch - MyStudioFactory - 1 node
    (365 documents *7)
    - Integration and management of Exalead
    - Migration from Exalead to Elasticsearch
    - Migration from Mysql (as analytics server) to Elasticsearch
    - I proposed and built Elasticsearch stable cluster design for
    real time analytics (logstash, kafka...)
    - Push campaigns using ES to filter and get mobile device tokens
    [email protected]

    View Slide

  4. What do we want
    when we use
    Elasticsearch?

    View Slide

  5. Speed
    Cluster stability
    - Queries < cache / fielddata >
    - Infrastructure < Master-Data-Client >
    - How many Shards ?

    View Slide

  6. Let's explain using an
    example...

    View Slide

  7. Imagine that we have a forum
    to index in Elasticsearch
    - Personal information for each user
    - Discussions and comments
    - And information about group/circles to link users

    View Slide

  8. S
    e
    r
    v
    e
    r 1 S
    e
    r
    v
    e
    r 2 S
    e
    r
    v
    e
    r 3
    C-D-(M) C-D-(M)
    C-D-M*
    C = Client
    D = Data
    M* = elected Master
    M = eligible as Master
    for server 1, 2 and 3 :
    CPU : 10 physical cores @ 2.80GHz
    RAM : 256GB or more...
    Disques : SSD 300GB or more...

    View Slide

  9. - publish & comment discussions
    - search in discussions & files
    - profile updates
    - create & join new groups/circles
    Peak hour at 5pm with
    75% of users connected
    - add discussions in favorites...

    View Slide

  10. What can happen at 5pm
    - Heap skyrockets !
    - To avoid this we have to change
    the infrastructure & requests, to
    keep our forum on Earth
    - Increase in garbage collection
    activity causing increased CPU
    usage

    View Slide

  11. Domino effect
    C-D-M*
    If the JVM is not responding for
    several seconds and our node was the
    master,
    a new election needs to happen, and
    the same issue can happen
    immediately after on
    the newly elected master, this could
    lead to a lot of instability in the
    cluster
    *even if it is not the master node that goes down, the rebalancing
    could take time and make your cluster sweat.

    View Slide

  12. Virtualization!
    Large heaps have the disadvantage of
    taking longer to collect, and this
    can cause cluster instability.
    ...
    Divide and conquer
    - don’t Cross 32 GB limit for heap memory!
    - set cluster.routing.allocation.same_shard.host

    View Slide

  13. Client nodes :
    How to organize all this nodes ?
    Master node :
    Data nodes :
    - they know where data is stored and can query the
    right shards directly and merge the results back
    - it is the source of truth of the cluster and manages
    the cluster state
    - the only node type who stores data, they are used for
    both indexing and search
    *D'ont use master as a client, that can result in unstable state
    after big aggregations, heavy sorting/scripts...
    - they keep data nodes protected behind a firewall with
    only client nodes being allowed to talk to them

    View Slide

  14. C C C
    M*
    M
    M
    D
    C C
    D D D
    D D D
    ...

    View Slide

  15. Helpful Tips
    - Setting the minimum number of eligible nodes to 2, the cluster
    will still be able to work in case of a loss of one of the master nodes
    - Leave half of the system's memory for the filesystem cache
    - Set small heap size (eg. 1GB might be enough) for dedicated master
    nodes so that they don't suffer from garbage collection pauses.
    - If HTTP module is not disabled in the master eligible nodes, they
    can also serve as result servers, collect shard responses from other
    nodes for sending the merge result back to clients, without having
    to search or index

    View Slide

  16. Keep balance in shards
    The shard must be small enough so that the hardware handling it
    will cope. There is no technical limit on the size of a shard, there
    is a limit to how big a shard can be with respect to your hardware
    - In our example we are going to keep each shard's size between 1 & 4GB,
    this allow's us to have fast queries and quickly resharding after restart or
    node goes down
    - If the shards grow too big, you have the option of rebuilding the entire
    Elasticsearch index with more shards, to scale out horizontally or split
    your index (per time period, per user...)
    Attention, once you're dealing with too many shards, the
    benefits of distributing the data gets pitted against the cost of
    coordinating between all of them, and it incurs a significant
    cost.

    View Slide

  17. Field data & Cache
    - Fielddata is used in sorting, aggregations & scripts, and can take a lot of
    RAM so it makes sense to disable field data loading on fields that don’t
    need it, for example those that are used for full-text search only
    - The results of each request is cached and is reused for future requests
    improving query performance, but building up and evicting filters over
    and over again for a continuous time period can induce to some very long
    garbage collections
    - But in Elasticsearch 2.0 filter caching changed, and it keeps track of the
    256 most recently used filters, and only caches those that appear 5 times
    or more, ES 2.0 prefers to be sure filters are reused before caching them
    - If there are filters that you do not reuse, turn off caching explicitly

    View Slide

  18. We're hiring, Elasticsearch devs!!
    -
    Thanks to Adrien Grand, the core training, and
    ES support team for their help
    -
    [email protected] - 21 Sept 2015

    View Slide