Predicting and Preventing Outages

Slide 1

Slide 1 text

Fabian Lange, Co-Founder Predicting and Preventing Outages

Slide 2

Slide 2 text

Instana is an easy to install solution that just works • Start agents • See your systems and applications • Get alerted on real issues • Understand problem details and causality Demo from our webinar www.youtube.com/watch?v=z14wXHzw5lU About Instana

Slide 3

Slide 3 text

How do we understand whats going on?

Slide 4

Slide 4 text

Existing monitoring systems have data But • No semantics • Bad accuracy • Bad resolution • Not the right data • Not enough data • Slow What is so hard about understanding?

Slide 5

Slide 5 text

Applying Semantics to Data

Slide 6

Slide 6 text

We modelled IT systems for our users • Components • Attributes • Metrics • Events • Changes • Issues • Traces • Incidents • Services • Metrics Modeling

Slide 7

Slide 7 text

Components are (standard) building blocks like • Hosts • Databases • Caches • Application Runtimes • Containers They have resource oriented metrics • Memory • CPU • Pool Usage And request oriented metrics • Calls • Response Time • Hit/Miss • Result Size What is a Component?

Slide 8

Slide 8 text

Events happen at some point in time Changes • Attributes of components change • Components get added or removed Issues • Components showing abnormal vital signs Incidents • Services showing abnormal vital signs Traces • Chain of code execution delivering a service What is an Event?

Slide 9

Slide 9 text

Services represent the objectives of an IT system Services can be very dynamic • Created from traces Exhibit key (business) KPIs • Load • Errors • Latency • Saturation • Instances What is a Service?

Slide 10

Slide 10 text

• Load  Measures how much demand/traffic is on the service. Normally measures in requests/ sec. • Latency  Response time of the service requests that have no error. Normally measures in milliseconds. • Errors  Can be measured as errors per second or as a percentage of the overall number of requests vs number of requests with error. • Saturation  Measures how full the most constrained resources of a service are. Can be the utilization of a thread pool. • Instances  The number of instances of a service. Can be number of containers that are delivering the same service or number of Tomcat application servers that have a service deployed. KPIs for Service Health

Slide 11

Slide 11 text

Components, Events and Services are added to the Dynamic Graph Contains information about • Component hierarchies • JVM runs in Docker on Linux • MSSQL runs on Windows • Technical services • Auth-Service deployed on that JVM • User-DB in that MSSQL • Interactions • Auth-Service calls User-DB • Business Services • Various technical services represent the business service “Login” Connecting the dots

Slide 12

Slide 12 text

Collecting the Right Data

Slide 13

Slide 13 text

The model exists already during data collection as Sensor • Sensors collect only whats needed • Reduce overhead • No point in collection what is not understood • Sensors automatically activate • Always collecting the data • Reacting to change • Sensors collect the finest granularity that makes sense • Usually second or per second data Data collection

Slide 14

Slide 14 text

Why 1-Second 1 Second 1 Minute

Slide 15

Slide 15 text

Fast Continuous Analytics

Slide 16

Slide 16 text

The backend combines data streams from the Sensors • Builds complete model • Combines individual data streams from all systems • Performs accurate statistics of metrics • Held in histograms • Processes raw data for health analysis • In memory windows • Health rules • Detectors • Stores appropriate roll-ups • Quality data also in the past Data processing

Slide 17

Slide 17 text

KPIs of services and Component Health are constantly analyzed for • Sudden Changes • Load of server dropping from 100 calls/s to 5/s • Trends • Free disk space decreasing over time • Outliers • Problematic traces • Algorithms heavily use Median Absolute Deviation (MAD) • en.wikipedia.org/wiki/Median_absolute_deviation • is more robust “average” Health Analysis

Slide 18

Slide 18 text

Sudden Drop / Increase “Twitter Paper”: Leveraging Cloud Data to Mitigate User Experience from Baking Bad  E-DIVISIVE WITH MEDIANS arxiv.org/pdf/1411.7955.pdf

Slide 19

Slide 19 text

Trend Detection

Slide 20

Slide 20 text

Outlier Detection

Slide 21

Slide 21 text

Components are also analyzed with specific health signatures • Rules specific to the component • Disk Space • Garbage Collection • Cluster State • Generic rules • Hit Rate • Throughput • Large constantly growing knowledge base Health Analysis

Slide 22

Slide 22 text

• Based on raw data, many metrics correlate • Waste of CPU to learn what we already know • Machine learned correlation is not actionable • Correlation is not causation Why not “machine learning” these signatures?

Slide 23

Slide 23 text

provider.with("load.1min", Double.class)  .slidingWindow(LoadHealthHypothesis.WINDOW_LENGTH, MINUTES)  .mapWithContext((context, values) -> {  long cpuCount = cpuCount(context);  if (cpuCount > 0) {  return values.entrySet() .stream() .filter((entry) -> entry.getValue() > cpuCount * 2) .count();  }  return 0L;  })  .filter(hypothesis.predicate())  .mapWithContext(hypothesis.health())  .toHealth("load"); Knowledge Implementation Example

Slide 24

Slide 24 text

• Incidents are reported based of service health problems • Incident engine collects related issues and changes by traversing the graph • To appear in an incident report, an issue or change needs to be in chronological proximity or proximity on the graph • Example: Slow login service • High CPU on the database host connected via trace • High garbage collection on the JVM running the service • Disk full on the host serving the database • Volume capacity reduced by configuration What makes an Incident

Slide 25

Slide 25 text

Where is the Prediction?

Slide 26

Slide 26 text

• Prediction is really hard across systems which constantly change • Specific health rules can operate in a more confined problem space • A JVM runs more and more garbage collection • A disc continues to have less free disc space • Increasing select times on a database • A connection pool becomes full • Capacity related metrics allow prediction • Throughput starts degrading • Latency starts increasing with certain load How much does Instana Predict?

Slide 27

Slide 27 text

There is no Root Cause

Slide 28

Slide 28 text

DATA COLLECTION Agents deployed on each Host

Slide 29

Slide 29 text

Agent Elasticsearch sensor Tomcat sensor JVM sensor Linux sensor Auto Discovery Communication Local  Sensor Memory & Context  Compression Agent Design • Specific sensors collect specific metrics  Reduces noise • 1 second resolution where applicable  Ensures high accuracy of data • Agent compresses to reduce bandwidth usage

Slide 30

Slide 30 text

• Java 8 • Apache Karaf 4 OSGi Container • Sensors implemented as OSGi bundles • Fully pluggable and auto updatable • Agent bundles provide infrastructure • Sensors use regular drivers to collect metrics • Tracing done by instrumentation • Open Tracing Implementor • “Dumb” Agent - Process data in backend Agent Design - Technology

Slide 31

Slide 31 text

STREAM PROCESSING PIPELINE Scalable in Our Cloud

Slide 32

Slide 32 text

Data Ingestion &  Health Calculation Realtime Stream Processing Incident Detection Alerting System Management Dependency Health Metrics Dynamic Knowledge Graph API & CLI Configuration Backend Design • Decompress Data into overall system state  We know the state of system under monitoring at any time • Apply semantics to the raw data  Understand high CPU load is bad, high throughput is good. • Correlate semantics across individual components  When throughput goes down, it could be correlated to CPU usage going up.

Slide 33

Slide 33 text

• Java 8 • Dropwizard • RxJava & Reactor • HdrHistogram • Home grown in memory graph DB • Elasticsearch • Cassandra • Redis • Kafka • UI with React and Three.js Backend Design - Technology

Slide 34

Slide 34 text

• 100+ machines in our cloud • 500 CPUs • 2TB RAM • 2 Regions and 3 Availability Zones • 500 nomad services • Cassandra 56TB 100k write/s • Elasticsearch 32TB 20k write/s • 5.5 billion documents • 500GB Trace data every day Backend Design - Sizing End of 2016

Slide 35

Slide 35 text

• Complex Systems always have failures www.instana.com/blog/no-root-cause-microservice-applications • Component vs Service Health www.instana.com/blog/monitoring-microservices-part-ii-understand- analyze-derive-application-health • Focus on objectives of services www.instana.com/blog/managing-quality-service-part-management- service-quality-model Further reading

Slide 36

Slide 36 text

• Understand semantics • Build meta-models • Do not calculate obvious and non-sensical • Do not read from a database • Processing live data is much faster • RAM is cheap • Optimize your code regularly • Stream processing big data has high multipliers Take-Aways

Slide 37

Slide 37 text

Thank You!