$30 off During Our Annual Pro Sale. View Details »

Managing and Tuning Elasticsearch & Logstash

Managing and Tuning Elasticsearch & Logstash

An approach to managing and tuning Elasticsearch and Logstash. Explains why the answers aren't cut and dried, and how to assess the specific needs of your cluster.

Aaron Mildenstein

July 14, 2016
Tweet

More Decks by Aaron Mildenstein

Other Decks in Programming

Transcript

  1. ‹#›
    by Aaron Mildenstein
    Managing and Tuning
    Elasticsearch & Logstash

    View Slide

  2. ‹#›
    It depends...
    The default answer to every question.

    View Slide

  3. It depends...
    • Ingest rate
    • Search rate
    • Aggregations
    • Hardware (spinning disks or SSDs)
    • Document size
    • Number of fields
    • Heap size
    • Shard and replica count
    • Shard size
    3

    View Slide

  4. Why does it depend?
    • Size of shards
    • Number of shards per node
    • Document size
    • Index mapping
    • Which fields are searchable? How big are they?
    • Are multi-fields being automatically created?
    • Are you storing the "message" field or the _all field?
    • Hardware considerations
    • SSD vs. spinning disk, CPU, RAM, node count, CPU, etc
    4

    View Slide

  5. These are some of the
    universal recommendations
    5
    Elasticsearch
    Basics

    View Slide

  6. Memory
    • Use up to half of system memory via ES_HEAP_SIZE
    • Keep heap below 30.5G to avoid
    • 64bit, uncompressed object pointers
    • Potentially long GC times
    • Avoid G1GC
    • Ignore this at your own peril
    • Disallow swap if possible.
    • Use bootstrap.mlockall if need be.
    • Add bootstrap.mlockall: true to elasticsearch.yml
    6

    View Slide

  7. Disk
    • Always try to use SSDs
    • It's all about the IOPs
    • Better able to handle traffic spikes
    • Use the noop scheduler when using SSD:
    • echo noop > /sys/block/{DEVICE}/queue/scheduler
    • If you can't have all SSD, aim for a hot/cold cluster architecture
    • Use SSDs for "hot" nodes
    • Use spinning disks for "cold" nodes
    7

    View Slide

  8. Disk (continued)
    • Avoid network storage for index data
    • NFS, AWS EFS, Azure Filesystem
    • Storage reliability is less of a concern
    • Replicas provide HA, so don't worry about using RAID 1/5/10
    • Local disk > SAN
    • RAID0 vs. path.data
    • RAID0 is more performant
    • path.data allows a node to continue to function
    8

    View Slide

  9. Bare metal vs. VM
    • VM Pros
    • Scale and deployment are easy
    • VM Cons
    • "Noisy neighbors"
    • Networked storage
    • Potential cluster instability if a VM host dies
    9

    View Slide

  10. Hardware selection
    • Large > Extra large
    • This: 4 core, 64G + 4 1TB drives
    • Not this: 12 core, 256G + 12 1TB drives
    • Multiple nodes per physical machine is possible, but discouraged
    • A single instance can saturate a machine
    • If compelled, larger machines may be useful for "cold" storage nodes.
    10

    View Slide

  11. File descriptors
    • The default is usually low
    • Increase to 32k, 64k, possibly even unlimited
    • See /etc/security/limits.conf
    11

    View Slide

  12. Network
    • Faster is better, of course
    • Avoid linking nodes via WAN
    • Try to have zero or very few hops between nodes
    • Ideally, separate transport and http traffic
    • Bind to different interfaces
    • Separate firewall rules for each kind
    • Use long lived HTTP connections
    • The official client libraries support this
    • If not possible, consider using a proxy or load-balancer with a server-
    side keep-alive
    12

    View Slide

  13. We've only just begun...
    13
    Tuning
    Elasticsearch

    View Slide

  14. Capacity Planning
    • Determine storage needs
    • Calculate sharding
    • Index rate
    • Document count
    • Cluster/Node limitations
    • https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing
    • More in depth than today
    14

    View Slide

  15. Storage estimation
    • Create an index with 1 shard, 0 replicas
    • Throw data at it
    • Calculate storage before and after a _forcemerge operation
    • Lather, rinse, repeat with different mapping configurations
    • With _all enabled/disabled (iterate with both!)
    • Try as many different settings per field as makes sense
    • When omitting _all, ensure searching works the way you expect
    15

    View Slide

  16. Shard performance estimation
    • Create an index with 1 shard, 0 replicas
    • Index real or at least realistic data
    • Query with real or at least realistic queries
    • Measure and plot index and search time
    • Determine the sweet spot
    • Diminishing returns
    16

    View Slide

  17. Shard count per node performance estimation
    • Create an index with 2 shards, 0 replicas (single node)
    • Repeat previous experiment
    • Did performance vary? How much?
    • Lather, rinse, repeat:
    • Increase shard count by 1
    • Where is the sweet spot?
    17

    View Slide

  18. Simulate in a small cluster
    • Take what you learned from the previous tests and:
    • Configure a small cluster
    • Add real or at least realistic data
    • Benchmark
    • Indexing
    • Querying
    • Both, at varying levels
    • Resource usage (disk, memory, document count)
    18

    View Slide

  19. Keep your cluster happy
    19
    Shards,
    Indices, &
    Production
    Management

    View Slide

  20. Shard management
    • Shard count per node matters
    • Resources are required to keep a shard "active"
    • Memory, CPU, I/O
    • Shards are not just data storage
    • Nodes cannot sustain an unlimited count of shards
    • Even if there's still disk space
    • That's like saying...
    20

    View Slide

  21. ‹#›
    I cant' be out of money!
    I still have checks in my
    checkbook!

    View Slide

  22. Indexing buffer
    • indices.memory.index_buffer_size
    • Default is 10% of the heap
    • Shared by all "active" indices on the node
    • Each "active" shard wants 250M of the buffer
    • Will be compressed/reduced if there is memory pressure
    • indices.memory.min_shard_index_buffer_size
    • Default is 4MB per shard
    • Will be compressed/reduced if there is memory pressure
    • "inactive" shards still consume this memory
    • Indexing stops if the buffer is exhausted
    22

    View Slide

  23. Index management
    • Open/Close
    • Hot/Cold
    • Snapshot/Restore
    • Coming in 5.0: Shrink API
    23

    View Slide

  24. Index management: Shrink API
    • New in Elasticsearch 5.0
    • First, it creates a new target index with the same definition as the source
    index, but with a smaller number of primary shards.
    • Then it hard-links segments from the source index into the target index. (If
    the file system doesn’t support hard-linking, then all segments are copied
    into the new index, which is a much more time consuming process.)
    • Finally, it recovers the target index as though it were a closed index which
    had just been re-opened.
    24

    View Slide

  25. Index management: Elasticsearch Curator
    • Alias
    • Allocation
    • Open/Close indices
    • Delete (indices and snapshots)
    • Create Index
    • Forcemerge
    • Change replica count
    • Snapshot/Restore
    25

    View Slide

  26. ‹#›
    Tuning Logstash

    View Slide

  27. Logstash is...
    • A text processing tool
    • Regular expressions
    • Plugins
    • Inputs
    • Filters
    • Outputs
    • Codecs
    • Turns unstructured data into structured data
    • Can be very CPU intensive
    27

    View Slide

  28. What not to do
    • Blindly increase workers (-w flag)
    • Blindly increase pipeline batch size (-b flag)
    • Add large amounts of heap space
    There is a way to methodically improve performance!
    28

    View Slide

  29. Tuning guide
    • Check your inputs and outputs (or possibly filters and codecs).
    • Logstash is only as fast as the slowest plugin in the pipeline
    • Check system statistics
    • CPU
    • Use top -H to see busy threads
    • If CPU usage is high but throughput is still slow, then look to the
    JVM section
    • Memory
    • Logstash uses a JVM. If there isn't enough heap space, swapping
    could be going on, which will slow things down.
    29

    View Slide

  30. Tuning guide
    • Check system statistics (continued)
    • I/O
    • Ensure the disk isn't saturated
    • Ensure the network isn't saturated
    • JVM Heap
    • CPU utilization will likely be quite high if the heap is too small, due to
    constant garbage collection.
    • Can test by doubling the heap size and testing performance (leave
    some for the OS and other processes)
    • Use jmap and/or other tools to measure the heap performance
    30

    View Slide

  31. Tuning guide
    • Tune worker settings
    • Increase the number of pipeline workers using the -w flag.
    • It is safe to scale this up to a multiple of CPU cores if need be as
    the threads can become idle on I/O.
    • Increase the number of "output" workers in the configuration
    • workers => 2
    • Do not make this value larger than the number of pipeline workers
    • Tune output batch size
    • Only available in some outputs.
    • flush_size in the Elasticsearch output
    31

    View Slide