Monitoring and Alerting: Knowing the Unknown

Monitoring and Alerting @amandasopkin

Where are we headed?

Introduction/Disclaimer @amandasopkin

Best Practices Overview

Principles of a Solid Monitoring/Alerting System No single point of
failure in monitoring system Alerts are clear and actionable If everything is quiet, something is wrong

Rule 0: Set priorities

Set priorities “99.99% uptime for the zoombinator responses to emails”
Minimize latency to X service from Z. Prioritize traffic on Fridays in November.

A note on 99% uptime

Rule 1: Alert on symptoms, not causes

Alert on symptoms, not causes The danger of false positives

Alert on symptoms, not causes Is this indicative of a
direct impact on user work? Are there various causes possible? Could this NOT have an impact?

Alert on symptoms, not causes Server load is high This
server is down for maintenance Slow site responses

Rule 2: Keep it simple

Keep it simple “The zoombinator received userid 10202 and returned
availability X, Y, Z” “After zoomification response 123 the zoombinator received 39393, proceed to create string akjkjd, and { data: [29, 393, x99] } and returned 6, 7, 8, { data: 3939 }.”

Rule 3: Consoles are

Consoles are

Consoles are Have no more than 5 graphs on a
console. Have no more than 5 plots (lines) on each graph. You can get away with more if it is a stacked/area graph.

Monster effectiveness

Having good consoles: monster effectiveness?

Having good consoles: monster effectiveness?* *Still very important!!

Rule 4: Make it easy to figure out which component
is at fault

Make it easy to figure out which component is at
fault “The zoombinator received userid 10202 and threw exception X on line 29” “Received userid 10202 and failed to return availability.

Rule 5: Create a process for addressing and resolving alerts

Create a process for addressing and resolving alerts 1. Alerts/pages
will go to the team alias. 2. Current “team guardian” will address all alerts and provide an update to the team. 3. Alerts will be diagrammed to show their trends.

Discovering points of failure Escalation procedures

Discovering points of failure Escalation procedures What will happen if
no one responds? What is the appropriate chain? After hours vs. on the job?

Discovering points of failure Discovering points of failure

Discovering points of failure: Latency

Discovering points of failure: Latency the 99th percentile latency is
the worst latency that was observed by 99% of all requests. it is the maximum value if you ignore the top 1%. a common notation for 99th percentile is "p99"

Discovering points of failure: Latency but p99 is still ignoring
the worst 1% of requests. if you want to see the worst latencies, you can take a look at the p100 latency, often called the maxiumum or the max

Analyzing latency

Discovering points of failure: Latency Distinguish between latency of successful
responses vs. failed responses

Discovering points of failure: Latency you can define a cut-off
latency of (for example) 500ms, and then measure how many requests exceeded that value. you can think of this as a "latency error budget". 1% of your (projected) traffic is the amount of "slow" requests that you can spend.

Discovering points of failure: Latency "the p99 latency, aggregated over
10 second intervals, should not go above 500ms for longer than 20 seconds."

Discovering points of failure: Error rates

Analyzing error rate

Discovering points of failure: Error rates What you might miss...

(Sidenote) Discovering points of failure: Error rates

Discovering points of failure: Traffic A measure of how much
demand is placed on the system

Discovering points of failure: Traffic

Analyzing traffic

Discovering points of failure: Saturation A measure of the “fullness”
of the service.

Discovering points of failure: Saturation

Analyzing saturation

Discovering points of failure: More generally 1. Evaluate trends in
alerts. 2. For significant problems, consider a larger evaluation/retrospective. 3. Consider GA and consumer care input in conjunction with data.

Discovering points of failure: Meta-monitoring

Meta-monitoring

Discovering points of failure: Regular batch jobs

Discovering points of failure: Conclusion The 4 signals: latency, error
rates, saturation, traffic Meta-monitoring is important Don’t forget batch jobs!

Best practices for logs

Framework for log levels FATAL ERROR WARN INFO DEBUG TRACE

Framework for log levels FATAL: a “shutdown” level problem ERROR:
an error that is fatal to what is being attempted WARN: anything that can cause oddities INFO: generally useful DEBUG: diagnostic information TRACE: one specific part of a function

What is the goal for each log?

Logging: dos and don’ts: Avoid side effects try { log.trace("Id="
+ request.getUser().getId() + " accesses " + manager.getPage().getUrl().toString()) } catch(NullPointerException e) { // do something }

Logging: dos and don’ts DO Be concise and descriptive “Availability
for userid 3939 at 10-19-2018T05:07 returned successfully in AvailabilityByDay.” “Availability returned successfully”

Logging: dos and don’ts DO Log method arguments and return
values Parameters: zpid: 39393, listingId: 3939 Return value: “true” “true”

Logging: dos and don’ts Consider logging ALL external input if(log.isDebugEnabled())
{ log.debug("Processing ids: {}", requestIds); } else { log.info("Processing ids size: {}", requestIds.size()); }

Logging: dos and don’ts Log exceptions the right way (Java)
try { Integer x = null; ++x; } catch (Exception e) { log.error(e); }

Logging: dos and don’ts Log exceptions the right way try
{ Integer x = null; ++x; } catch (Exception e) { log.error(e.toString()); log.error(e.getMessage()); }

{ Integer x = null; ++x; } catch (Exception e) { log.error(e); log.error(e.toString()); log.error(e.getMessage()); log.error("Error reading configuration file: " + e); }

{ Integer x = null; ++x; } catch (Exception e) { log.error("", e); log.error("Error reading configuration file", e); }

Logging: dos and don’ts Provide context “listingId 399393 returned for
call to get tour availability from BDP” “listingId 39939”

Logging: dos and don’ts Use unique identifiers “tourId: 39393” “id:
output 39393”

In conclusion... Make it easy to fix!

Now that we’ve got the main bases down...

A note on testing.... Code Review Integration Testing Unit Testing
Manual Testing

30% 45% A note on testing 55% Code Review Integration
Testing Unit Testing Manual Testing 5-10%

Finding the weird s***... Load test your system

Finding the weird s***... Load test your system Do a
bug bash

Finding the weird s***... Load test your system Do a
bug bash Always be alert

Things you might STILL miss Highly segmented failures Small buckets
of users “Blips” on the radar

Things you might miss Highly segmented failures alerts by segment
Small buckets of users consumer care “Blips” on the radar threshold alerts

When all else fails...

Why do we care?

Why do we care? • Early signs of bigger issues
• Potentially really bad for a small vocal number (bad for ratings) • Pie in the sky

Anomaly detection

Kinds of anomalies: • Point anomalies: a single blip •
Contextual anomalies: bad within context • Collective anomalies: these shouldn’t happen together

Anomaly detection: Thresholds

Threshold anomaly detection

Multivariate correlation vs.

Multivariate anomaly detection

Cyclical thresholds (days of week)

Cyclical thresholds

Learning patterns of traffic

Pattern anomaly detection

Key challenges

Server data not normally distributed

Overfitting based on complex model

Quasi-seasonal Traffic November 2017 Traffic Thanksgiving 2017

Many missed anomalies

Different bounds for different metrics

Changing environment

“Normal” changes

Case Study 1: Basic Anomaly Detection

Case Study 2: Lots of metrics combined

Case Study 3: Beyond

Kibana Flexible Easy anomaly detection Open source

So why do we do this again?

Anomaly detection... Gives us early signals Keeps us safe Is
computationally hard, but possible

How effective is it against monsters?

Open Source Recommendations

Kibana (ELK) by Elastic

Prometheus

Prometheus example... https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

In conclusion...

Conclusion... Cover your bases with 4 golden signals Be proactive
about your system Consider anomaly detection for your use case

Thank you! @amandasopkin

Sources • Flaticon.com • https://igor.io/latency/ • https://prometheus.io • https://landing.google.com/sre/sre-book/chapters/ monitoring-distributed-systems/
• Stack overflow

Monitoring and Alerting: Knowing the Unknown

Monitoring and Alerting: Knowing the Unknown

More Decks by Amanda Sopkin

Other Decks in Programming

Featured

Transcript