The Usual Suspects: Automatic Alerts to Monitor your Cluster

March 9th, 2017 The Usual Suspects: Automatic Cluster Alerts to
Monitor Your Cluster Bohyun Kim, Senior Product Manager, @ensheneer Chris Earle, Monitoring Lead, @pickypg Antonio Bonuccelli, Support Engineer, @nellicus

The Usual Suspects

We will not bore you... 1 Marvel to X-Pack monitoring
2 Latest in X-Pack monitoring 3 Demo 4 Future of X-Pack monitoring 5 Tackling Real World Problems using X-Pack monitoring

Our Team Chris Earle Bohyun Kim Tim Sullivan Steve Kearns
Uri Boness Jurgen Altziebler CJ Cenizal

Marvel: Monitoring Elasticsearch 1.x

Marvel: Monitoring Elasticsearch 2.x

9 X-Pack Extensions for the Elastic Stack Security Alerting Monitoring
Reporting Graph Analytics Single Install, included in Elastic Subscription

X-Pack Monitoring

X-Pack monitoring: Not just a new name 1 1 A
window into the Elastic Stack 5.0 Kibana Monitoring / Improved HTTP Exporter 5.1 Elasticsearch Advanced Node and Index Views 5.2 Logstash Monitoring / Cgroup (Container-aware) Metrics 5.4 Cluster Alerts for issues

Cluster Alerts: Proactive Monitoring 12

Cluster Alerts: Life Cycle 13 Monitoring agent collects data about
node(s) or instance(s) 1 2 3 Watch executes on a schedule; looks at monitoring data and any existing alert The associated alert needs to be: ignored, created, updated, or resolved

14 Cluster Alerts: Reality Not just for Elasticsearch. Links to
relevant data.

DEMO 1

16 Cluster Alerts Deeper Integration Log Monitoring • More Control
/ Advanced • UI Configuration • In-app Documentation • Monitor ML/Beats • Machine Learning • Logstash Pipeline Viewer • Management • Experimentation • Elasticsearch • Alerting! Monitoring the Future

Tackling Real World Problems Using X-Pack monitoring

Agenda and Scope 1 For each scenario 2 Observe symptoms
3 Understand impact 4 Remediation 5 Alerting tips

Real World Scenarios 1 Running out of disk space in
Elasticsearch nodes 2 Event throughput drop following Logstash configuration change 3 Cluster instability after misuse of _forcemerge API

Running out of disk space in Elasticsearch nodes Scenario I

Background: Disk-Based shard allocation Using watermarks 1 Low watermark threshold
(defaults 85% total disk space) 2 A high watermark threshold (defaults to 90% total disk space) 3 When low watermark is hit : only allow new primaries allocation 4 When high watermark is hit : relocate shards away 5 Can be configured with absolute values e.g. “50gb” (space left)

Background: Disk-Based shard allocation 0% 100% ES data node

Background: Disk-Based shard allocation 0% 100% 85% Only allow NEW
PRIMARY shard allocations ES data node

Background: Disk-Based shard allocation 0% 100% Relocate shards away 90%
ES data node

Scenario I 1 Indexing into hourly indices with Logstash 2
3 nodes cluster 3 All nodes have hit the Low disk watermark at 17:53 4 Expecting new indices to be created at 18:00

Scenario I X-pack X-pack

Scenario I X-pack X-pack Low WM hit

Yellow Cluster!

Remediation - free or add space 1 Add more space/nodes
2 Remove unneeded indices 3 Remove Replicas ( don’t do that )

Prevention 1 Use Curator to trim your indices retention 2
Use alerting to get early warnings! 3 Use alerting to automate remediation?

GET _nodes/stats?filter_path=nodes.*.fs.total.* { "nodes": { "vWXCBrlASwy19YeHturzXg": { "fs": { "total":
{ "total_in_bytes": 484644716544, "free_in_bytes": 220382507008, "available_in_bytes": 195740344320 } } } } } Data you can alert on

Event throughput drop following Logstash configuration change Scenario II

Logstash perfomance metrics

filter { if [type] == "my_monitored_type" { metrics { meter
=> "events" add_tag => "metric" } } } Logstash perfomance metrics

output { if "metric" in [tags] { stdout { codec
=> line { format => "rate: %{[events][rate_1m]}" } } } } Logstash perfomance metrics

% bin/logstash -f example.conf rate: 23721.983566819246 rate: 24811.395722536377 rate: 25875.892745934525
rate: 26836.42375967113 Logstash perfomance metrics

% bin/logstash -f example.conf rate: 23721.983566819246 rate: 24811.395722536377 rate: 25875.892745934525
rate: 26836.42375967113 Logstash perfomance metrics It used to be hard work - Metric filter plugin - 4/4

42 % bin/logstash -f example.conf rate: 23721.983566819246 rate: 24811.395722536377 rate:
25875.892745934525 rate: 26836.42375967113 Logstash perfomance metrics It used to be hard work - Metric filter plugin - 4/4

X-Pack monitoring for Logstash 1 Performance metrics on dashboards out
of the box (Events IN/OUT/LATENCY) 2 Adds up to existing Elasticsearch and Kibana monitoring 3 GA in 5.2 - more to come!

Scenario II 1 A configuration change goes in (using automatic
config reload) 2 From “all is good!” to a dramatic performance drop

[2017-02-28T23:58:13,479][WARN ][logstash.agent] fetched new config for pipeline. upgrading.. Scenario II
- new config deployed

Considerations 1 Events latency is a crucial metric 2 Before
you couldn’t tell clearly “Is it the chicken(Logstash) or the egg(Elasticsearch)?” 3 We now can clearly see events taking longer to go through the pipeline (regardless of destination status) 4 Next releases will bring visual plugin-level metrics

Prevention 1 Test your configs :-)

X-Pack alerting 1 Alert on Event Latency (e.g. GT than
a fixed threshold) 2 Alert on significant Event throughput variations

X-Pack alerting - plotting raw Logstash metrics from API Alert
on Event Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash periodically reports events sent out since uptime (first derivative is always GTE zero)

X-Pack alerting - plotting raw Logstash metrics from API Alert
on Event Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash periodically reports events sent out since uptime (first derivative is always GTE zero) Rate of change = ?

X-Pack alerting - plotting raw Logstash metrics from API 60
Alert on Event Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash periodically reports events sent out since uptime (first derivative is always GTE zero) Rate of change = ?

X-Pack alerting - Pipeline agg 1st Derivative Alert on Event
Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash stats API reports events sent out since uptime (first derivative is always greater or equal to zero) Throughput Increasing

Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash stats API reports events sent out since uptime (first derivative is always greater or equal to zero) Throughput Increasing Throughput Steady

Latency (e.g. GT than a threshold) Alert on significant Event throughput variations Logstash stats API reports events sent out since uptime (first derivative is always greater or equal to zero) Throughput Increasing Throughput Steady Logstash periodically reports events sent out since uptime (first derivative is always GTE zero)

X-Pack alerting - Data you can alert on "aggs": {
"date_histogram": { "avgEventOut": { "avg": {"field": "logstash_stats.events.out" } }, “first_deriv”: { "derivative": { "buckets_path": "avgEventOut" }}, "second_deriv": { "derivative": { "buckets_path": “first_deriv” } ….

{ "key_as_string": "2017-03-03T01:00:00.000Z", "key": 1488502800000, "doc_count": 720, "avgEventOut": { "value":
46898292.92361111}, "hour_rate_deriv": { "value": 1421203.4340422675 }, "2nd_hour_rate_deriv": { "value": -6672806.213859908} } X-Pack alerting - Data you can alert on

Cluster instability after misuse of _forcemerge API Scenario III

https://www.youtube.com/watch?v=YW0bOvLp72 E

Lucene Segments Merging https://www.youtube.com/watch?v=YW0bOvLp72 E

Forcemerge API 1 From many little segments to few big
ones 2 Reduces search overhead thus improves search performance 3 It is very CPU and I/O intensive and unbounded 4 Do not run on nodes doing active indexing 5 “Hot-Warm” architecture / off business hours

Forcemerge API 1 Can severely impact a cluster, absorbing most
of its CPU and I/O bandwidth 2 Bulk Index rejections/query timeouts can be expected 3 Can last hours, proportional to amount of segments to be processed

Hot/Warm architecture HOT tier X-pack Data Nodes - Warm (X)
X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days)

X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days) _forcemerge on hot indices?

Impact of calling Forcemerge

Hot Threads API 1 Use _nodes/hot_threads API to see where
the CPU is going 2 An API allowing to get the current hot threads on each node in the cluster 3 The output is plain text with a breakdown of each node’s top hot threads

Hot Threads API :::{node1}{l4X6KlaxTEWdakFLVbiwoQ}{IbRuacjMTWyffEMjgliYvA}{w530}{192.168.1.8:9301}{zone=zon e_B} Hot threads at 2017-02-16T16:27:57.583Z, interval=500ms,
busiestThreads=3, ignoreIdleThreads=true: 101.0% (505.2ms out of 500ms) cpu usage by thread 'elasticsearch[node1][[.security_audit_log- 2017.02.13][0]: Lucene Merge Thread #0]' 3/10 snapshots sharing following 13 elements org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsW iter.java:635) 71)

X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days)

X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days) _forcemerge on warm!

X-pack WARM tier Elasticsearch Cluster Bulk Indexing Relocate shards qualifying for warm (e.g. older than 3 days) - no active indexing on warm nodes -schedule for run off business/peak hours

No tiers X-pack _forcemerge?

No tiers X-pack 1 Outside business hours 2 Only on
indices where you’re no longer writing to

Take-aways 1 X-Pack monitoring delivers great out of the box
value and added visibility for free! 2 If you wanna take it further, X-Pack alerting gives you unlimited extra pair of eyes and early detection

Thank you! Grazie!

Questions? Visit us at the AMA

www.elastic.c o

The Usual Suspects: Automatic Alerts to Monitor...

The Usual Suspects: Automatic Alerts to Monitor your Cluster

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript