Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Usual Suspects: Automatic Alerts to Monitor your Cluster

Elastic Co
March 09, 2017

The Usual Suspects: Automatic Alerts to Monitor your Cluster

When monitoring met alerting, the average time spent to troubleshoot went down and the average sleep time went up. True story.

X-Pack, which made its first debut with the 5.0 release of the Elastic Stack, brings monitoring and alerting features together to enable built-in cluster alerts. Chris and Bohyun will go over the latest in monitoring and management in the first portion of the talk, then Antonio will talk about how to solve real-world problems using monitoring data based on customer scenarios he's helped with as part of the Elastic support team.

Antonio Bonuccelli l Support Engineer l Elastic
Chris Earle l Monitoring Lead l Elastic
Bohyun Kim l Senior Product Manager l Elastic

Elastic Co

March 09, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. March 9th, 2017 The Usual Suspects: Automatic Cluster Alerts to

    Monitor Your Cluster Bohyun Kim, Senior Product Manager, @ensheneer Chris Earle, Monitoring Lead, @pickypg Antonio Bonuccelli, Support Engineer, @nellicus
  2. We will not bore you... 1 Marvel to X-Pack monitoring

    2 Latest in X-Pack monitoring 3 Demo 4 Future of X-Pack monitoring 5 Tackling Real World Problems using X-Pack monitoring
  3. Our Team Chris Earle Bohyun Kim Tim Sullivan Steve Kearns

    Uri Boness Jurgen Altziebler CJ Cenizal
  4. 9 X-Pack Extensions for the Elastic Stack Security Alerting Monitoring

    Reporting Graph Analytics Single Install, included in Elastic Subscription
  5. X-Pack monitoring: Not just a new name 1 1 A

    window into the Elastic Stack 5.0 Kibana Monitoring / Improved HTTP Exporter 5.1 Elasticsearch Advanced Node and Index Views 5.2 Logstash Monitoring / Cgroup (Container-aware) Metrics 5.4 Cluster Alerts for issues
  6. Cluster Alerts: Life Cycle 13 Monitoring agent collects data about

    node(s) or instance(s) 1 2 3 Watch executes on a schedule; looks at monitoring data and any existing alert The associated alert needs to be: ignored, created, updated, or resolved
  7. 16 Cluster Alerts Deeper Integration Log Monitoring • More Control

    / Advanced • UI Configuration • In-app Documentation • Monitor ML/Beats • Machine Learning • Logstash Pipeline Viewer • Management • Experimentation • Elasticsearch • Alerting! Monitoring the Future
  8. Agenda and Scope 1 For each scenario 2 Observe symptoms

    3 Understand impact 4 Remediation 5 Alerting tips
  9. Real World Scenarios 1 Running out of disk space in

    Elasticsearch nodes 2 Event throughput drop following Logstash configuration change 3 Cluster instability after misuse of _forcemerge API
  10. Background: Disk-Based shard allocation Using watermarks 1 Low watermark threshold

    (defaults 85% total disk space) 2 A high watermark threshold (defaults to 90% total disk space) 3 When low watermark is hit : only allow new primaries allocation 4 When high watermark is hit : relocate shards away 5 Can be configured with absolute values e.g. “50gb” (space left)
  11. Scenario I 1 Indexing into hourly indices with Logstash 2

    3 nodes cluster 3 All nodes have hit the Low disk watermark at 17:53 4 Expecting new indices to be created at 18:00
  12. Remediation - free or add space 1 Add more space/nodes

    2 Remove unneeded indices 3 Remove Replicas ( don’t do that )
  13. Prevention 1 Use Curator to trim your indices retention 2

    Use alerting to get early warnings! 3 Use alerting to automate remediation?
  14. GET _nodes/stats?filter_path=nodes.*.fs.total.* { "nodes": { "vWXCBrlASwy19YeHturzXg": { "fs": { "total":

    { "total_in_bytes": 484644716544, "free_in_bytes": 220382507008, "available_in_bytes": 195740344320 } } } } } Data you can alert on
  15. filter { if [type] == "my_monitored_type" { metrics { meter

    => "events" add_tag => "metric" } } } Logstash perfomance metrics
  16. output { if "metric" in [tags] { stdout { codec

    => line { format => "rate: %{[events][rate_1m]}" } } } } Logstash perfomance metrics
  17. % bin/logstash -f example.conf rate: 23721.983566819246 rate: 24811.395722536377 rate: 25875.892745934525

    rate: 26836.42375967113 Logstash perfomance metrics It used to be hard work - Metric filter plugin - 4/4
  18. 42 % bin/logstash -f example.conf rate: 23721.983566819246 rate: 24811.395722536377 rate:

    25875.892745934525 rate: 26836.42375967113 Logstash perfomance metrics It used to be hard work - Metric filter plugin - 4/4
  19. X-Pack monitoring for Logstash 1 Performance metrics on dashboards out

    of the box (Events IN/OUT/LATENCY) 2 Adds up to existing Elasticsearch and Kibana monitoring 3 GA in 5.2 - more to come!
  20. Scenario II 1 A configuration change goes in (using automatic

    config reload) 2 From “all is good!” to a dramatic performance drop
  21. Considerations 1 Events latency is a crucial metric 2 Before

    you couldn’t tell clearly “Is it the chicken(Logstash) or the egg(Elasticsearch)?” 3 We now can clearly see events taking longer to go through the pipeline (regardless of destination status) 4 Next releases will bring visual plugin-level metrics
  22. Considerations 1 Events latency is a crucial metric 2 Before

    you couldn’t tell clearly “Is it the chicken(Logstash) or the egg(Elasticsearch)?” 3 We now can clearly see events taking longer to go through the pipeline (regardless of destination status) 4 Next releases will bring visual plugin-level metrics
  23. X-Pack alerting 1 Alert on Event Latency (e.g. GT than

    a fixed threshold) 2 Alert on significant Event throughput variations
  24. X-Pack alerting - plotting raw Logstash metrics from API Alert

    on Event Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash periodically reports events sent out since uptime (first derivative is always GTE zero)
  25. X-Pack alerting - plotting raw Logstash metrics from API Alert

    on Event Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash periodically reports events sent out since uptime (first derivative is always GTE zero) Rate of change = ?
  26. X-Pack alerting - plotting raw Logstash metrics from API 60

    Alert on Event Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash periodically reports events sent out since uptime (first derivative is always GTE zero) Rate of change = ?
  27. X-Pack alerting - Pipeline agg 1st Derivative Alert on Event

    Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash stats API reports events sent out since uptime (first derivative is always greater or equal to zero) Throughput Increasing
  28. X-Pack alerting - Pipeline agg 1st Derivative Alert on Event

    Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash stats API reports events sent out since uptime (first derivative is always greater or equal to zero) Throughput Increasing Throughput Steady
  29. X-Pack alerting - Pipeline agg 1st Derivative Alert on Event

    Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash stats API reports events sent out since uptime (first derivative is always greater or equal to zero) Throughput Increasing Throughput Steady Logstash periodically reports events sent out since uptime (first derivative is always GTE zero)
  30. X-Pack alerting - Pipeline agg 1st Derivative Alert on Event

    Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash stats API reports events sent out since uptime (first derivative is always greater or equal to zero) Throughput Increasing Throughput Steady Logstash periodically reports events sent out since uptime (first derivative is always GTE zero)
  31. X-Pack alerting - Data you can alert on "aggs": {

    "date_histogram": { "avgEventOut": { "avg": {"field": "logstash_stats.events.out" } }, “first_deriv”: { "derivative": { "buckets_path": "avgEventOut" }}, "second_deriv": { "derivative": { "buckets_path": “first_deriv” } ….
  32. { "key_as_string": "2017-03-03T01:00:00.000Z", "key": 1488502800000, "doc_count": 720, "avgEventOut": { "value":

    46898292.92361111}, "hour_rate_deriv": { "value": 1421203.4340422675 }, "2nd_hour_rate_deriv": { "value": -6672806.213859908} } X-Pack alerting - Data you can alert on
  33. Forcemerge API 1 From many little segments to few big

    ones 2 Reduces search overhead thus improves search performance 3 It is very CPU and I/O intensive and unbounded 4 Do not run on nodes doing active indexing 5 “Hot-Warm” architecture / off business hours
  34. Forcemerge API 1 Can severely impact a cluster, absorbing most

    of its CPU and I/O bandwidth 2 Bulk Index rejections/query timeouts can be expected 3 Can last hours, proportional to amount of segments to be processed
  35. Hot/Warm architecture HOT tier X-pack Data Nodes - Warm (X)

    X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days)
  36. Hot/Warm architecture HOT tier X-pack Data Nodes - Warm (X)

    X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days) _forcemerge on hot indices?
  37. Hot/Warm architecture HOT tier X-pack Data Nodes - Warm (X)

    X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days) _forcemerge on hot indices?
  38. Hot Threads API 1 Use _nodes/hot_threads API to see where

    the CPU is going 2 An API allowing to get the current hot threads on each node in the cluster 3 The output is plain text with a breakdown of each node’s top hot threads
  39. Hot Threads API :::{node1}{l4X6KlaxTEWdakFLVbiwoQ}{IbRuacjMTWyffEMjgliYvA}{w530}{192.168.1.8:9301}{zone=zon e_B} Hot threads at 2017-02-16T16:27:57.583Z, interval=500ms,

    busiestThreads=3, ignoreIdleThreads=true: 101.0% (505.2ms out of 500ms) cpu usage by thread 'elasticsearch[node1][[.security_audit_log- 2017.02.13][0]: Lucene Merge Thread #0]' 3/10 snapshots sharing following 13 elements org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsW iter.java:635) 71)
  40. Hot/Warm architecture HOT tier X-pack Data Nodes - Warm (X)

    X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days)
  41. Hot/Warm architecture HOT tier X-pack Data Nodes - Warm (X)

    X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days) _forcemerge on warm!
  42. Hot/Warm architecture HOT tier X-pack Data Nodes - Warm (X)

    X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days) - no active indexing on warm nodes -schedule for run off business/peak hours
  43. No tiers X-pack 1 Outside business hours 2 Only on

    indices where you’re no longer writing to
  44. No tiers X-pack 1 Outside business hours 2 Only on

    indices where you’re no longer writing to
  45. Take-aways 1 X-Pack monitoring delivers great out of the box

    value and added visibility for free! 2 If you wanna take it further, X-Pack alerting gives you unlimited extra pair of eyes and early detection