Predicting and Preventing Outages

Fabian Lange, Co-Founder Predicting and Preventing Outages

Instana is an easy to install solution that just works
• Start agents • See your systems and applications • Get alerted on real issues • Understand problem details and causality Demo from our webinar www.youtube.com/watch?v=z14wXHzw5lU About Instana

How do we understand whats going on?

Existing monitoring systems have data But • No semantics •
Bad accuracy • Bad resolution • Not the right data • Not enough data • Slow What is so hard about understanding?

Applying Semantics to Data

We modelled IT systems for our users • Components •
Attributes • Metrics • Events • Changes • Issues • Traces • Incidents • Services • Metrics Modeling

Components are (standard) building blocks like • Hosts • Databases
• Caches • Application Runtimes • Containers They have resource oriented metrics • Memory • CPU • Pool Usage And request oriented metrics • Calls • Response Time • Hit/Miss • Result Size What is a Component?

Events happen at some point in time Changes • Attributes
of components change • Components get added or removed Issues • Components showing abnormal vital signs Incidents • Services showing abnormal vital signs Traces • Chain of code execution delivering a service What is an Event?

Services represent the objectives of an IT system Services can
be very dynamic • Created from traces Exhibit key (business) KPIs • Load • Errors • Latency • Saturation • Instances What is a Service?

• Load  Measures how much demand/traffic is on the service.
Normally measures in requests/ sec. • Latency  Response time of the service requests that have no error. Normally measures in milliseconds. • Errors  Can be measured as errors per second or as a percentage of the overall number of requests vs number of requests with error. • Saturation  Measures how full the most constrained resources of a service are. Can be the utilization of a thread pool. • Instances  The number of instances of a service. Can be number of containers that are delivering the same service or number of Tomcat application servers that have a service deployed. KPIs for Service Health

Components, Events and Services are added to the Dynamic Graph
Contains information about • Component hierarchies • JVM runs in Docker on Linux • MSSQL runs on Windows • Technical services • Auth-Service deployed on that JVM • User-DB in that MSSQL • Interactions • Auth-Service calls User-DB • Business Services • Various technical services represent the business service “Login” Connecting the dots

Collecting the Right Data

The model exists already during data collection as Sensor •
Sensors collect only whats needed • Reduce overhead • No point in collection what is not understood • Sensors automatically activate • Always collecting the data • Reacting to change • Sensors collect the finest granularity that makes sense • Usually second or per second data Data collection

Why 1-Second 1 Second 1 Minute

Fast Continuous Analytics

The backend combines data streams from the Sensors • Builds
complete model • Combines individual data streams from all systems • Performs accurate statistics of metrics • Held in histograms • Processes raw data for health analysis • In memory windows • Health rules • Detectors • Stores appropriate roll-ups • Quality data also in the past Data processing

KPIs of services and Component Health are constantly analyzed for
• Sudden Changes • Load of server dropping from 100 calls/s to 5/s • Trends • Free disk space decreasing over time • Outliers • Problematic traces • Algorithms heavily use Median Absolute Deviation (MAD) • en.wikipedia.org/wiki/Median_absolute_deviation • is more robust “average” Health Analysis

Sudden Drop / Increase “Twitter Paper”: Leveraging Cloud Data to
Mitigate User Experience from Baking Bad  E-DIVISIVE WITH MEDIANS arxiv.org/pdf/1411.7955.pdf

Trend Detection

Outlier Detection

Components are also analyzed with specific health signatures • Rules
specific to the component • Disk Space • Garbage Collection • Cluster State • Generic rules • Hit Rate • Throughput • Large constantly growing knowledge base Health Analysis

• Based on raw data, many metrics correlate • Waste
of CPU to learn what we already know • Machine learned correlation is not actionable • Correlation is not causation Why not “machine learning” these signatures?

provider.with("load.1min", Double.class)  .slidingWindow(LoadHealthHypothesis.WINDOW_LENGTH, MINUTES)  .mapWithContext((context, values) -> {  long cpuCount
= cpuCount(context);  if (cpuCount > 0) {  return values.entrySet() .stream() .filter((entry) -> entry.getValue() > cpuCount * 2) .count();  }  return 0L;  })  .filter(hypothesis.predicate())  .mapWithContext(hypothesis.health())  .toHealth("load"); Knowledge Implementation Example

• Incidents are reported based of service health problems •
Incident engine collects related issues and changes by traversing the graph • To appear in an incident report, an issue or change needs to be in chronological proximity or proximity on the graph • Example: Slow login service • High CPU on the database host connected via trace • High garbage collection on the JVM running the service • Disk full on the host serving the database • Volume capacity reduced by configuration What makes an Incident

Where is the Prediction?

• Prediction is really hard across systems which constantly change
• Specific health rules can operate in a more confined problem space • A JVM runs more and more garbage collection • A disc continues to have less free disc space • Increasing select times on a database • A connection pool becomes full • Capacity related metrics allow prediction • Throughput starts degrading • Latency starts increasing with certain load How much does Instana Predict?

There is no Root Cause

DATA COLLECTION Agents deployed on each Host

Agent Elasticsearch sensor Tomcat sensor JVM sensor Linux sensor Auto
Discovery Communication Local  Sensor Memory & Context  Compression Agent Design • Specific sensors collect specific metrics  Reduces noise • 1 second resolution where applicable  Ensures high accuracy of data • Agent compresses to reduce bandwidth usage

• Java 8 • Apache Karaf 4 OSGi Container •
Sensors implemented as OSGi bundles • Fully pluggable and auto updatable • Agent bundles provide infrastructure • Sensors use regular drivers to collect metrics • Tracing done by instrumentation • Open Tracing Implementor • “Dumb” Agent - Process data in backend Agent Design - Technology

STREAM PROCESSING PIPELINE Scalable in Our Cloud

Data Ingestion &  Health Calculation Realtime Stream Processing Incident Detection
Alerting System Management Dependency Health Metrics Dynamic Knowledge Graph API & CLI Configuration Backend Design • Decompress Data into overall system state  We know the state of system under monitoring at any time • Apply semantics to the raw data  Understand high CPU load is bad, high throughput is good. • Correlate semantics across individual components  When throughput goes down, it could be correlated to CPU usage going up.

• Java 8 • Dropwizard • RxJava & Reactor •
HdrHistogram • Home grown in memory graph DB • Elasticsearch • Cassandra • Redis • Kafka • UI with React and Three.js Backend Design - Technology

• 100+ machines in our cloud • 500 CPUs •
2TB RAM • 2 Regions and 3 Availability Zones • 500 nomad services • Cassandra 56TB 100k write/s • Elasticsearch 32TB 20k write/s • 5.5 billion documents • 500GB Trace data every day Backend Design - Sizing End of 2016

• Complex Systems always have failures www.instana.com/blog/no-root-cause-microservice-applications • Component vs
Service Health www.instana.com/blog/monitoring-microservices-part-ii-understand- analyze-derive-application-health • Focus on objectives of services www.instana.com/blog/managing-quality-service-part-management- service-quality-model Further reading

• Understand semantics • Build meta-models • Do not calculate
obvious and non-sensical • Do not read from a database • Processing live data is much faster • RAM is cheap • Optimize your code regularly • Stream processing big data has high multipliers Take-Aways

Thank You!

Predicting and Preventing Outages

Predicting and Preventing Outages

More Decks by Fabian Lange

Other Decks in Technology

Featured

Transcript