Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predicting and Preventing Outages

Predicting and Preventing Outages

How Instana works and how we build it - Talk from FFM Analytics Meetup

Fabian Lange

January 20, 2017
Tweet

More Decks by Fabian Lange

Other Decks in Technology

Transcript

  1. Instana is an easy to install solution that just works

    • Start agents • See your systems and applications • Get alerted on real issues • Understand problem details and causality Demo from our webinar www.youtube.com/watch?v=z14wXHzw5lU About Instana
  2. Existing monitoring systems have data But • No semantics •

    Bad accuracy • Bad resolution • Not the right data • Not enough data • Slow What is so hard about understanding?
  3. We modelled IT systems for our users • Components •

    Attributes • Metrics • Events • Changes • Issues • Traces • Incidents • Services • Metrics Modeling
  4. Components are (standard) building blocks like • Hosts • Databases

    • Caches • Application Runtimes • Containers They have resource oriented metrics • Memory • CPU • Pool Usage And request oriented metrics • Calls • Response Time • Hit/Miss • Result Size What is a Component?
  5. Events happen at some point in time Changes • Attributes

    of components change • Components get added or removed Issues • Components showing abnormal vital signs Incidents • Services showing abnormal vital signs Traces • Chain of code execution delivering a service What is an Event?
  6. Services represent the objectives of an IT system Services can

    be very dynamic • Created from traces Exhibit key (business) KPIs • Load • Errors • Latency • Saturation • Instances What is a Service?
  7. • Load
 Measures how much demand/traffic is on the service.

    Normally measures in requests/ sec. • Latency
 Response time of the service requests that have no error. Normally measures in milliseconds. • Errors
 Can be measured as errors per second or as a percentage of the overall number of requests vs number of requests with error. • Saturation
 Measures how full the most constrained resources of a service are. Can be the utilization of a thread pool. • Instances
 The number of instances of a service. Can be number of containers that are delivering the same service or number of Tomcat application servers that have a service deployed. KPIs for Service Health
  8. Components, Events and Services are added to the Dynamic Graph

    Contains information about • Component hierarchies • JVM runs in Docker on Linux • MSSQL runs on Windows • Technical services • Auth-Service deployed on that JVM • User-DB in that MSSQL • Interactions • Auth-Service calls User-DB • Business Services • Various technical services represent the business service “Login” Connecting the dots
  9. The model exists already during data collection as Sensor •

    Sensors collect only whats needed • Reduce overhead • No point in collection what is not understood • Sensors automatically activate • Always collecting the data • Reacting to change • Sensors collect the finest granularity that makes sense • Usually second or per second data Data collection
  10. The backend combines data streams from the Sensors • Builds

    complete model • Combines individual data streams from all systems • Performs accurate statistics of metrics • Held in histograms • Processes raw data for health analysis • In memory windows • Health rules • Detectors • Stores appropriate roll-ups • Quality data also in the past Data processing
  11. KPIs of services and Component Health are constantly analyzed for

    • Sudden Changes • Load of server dropping from 100 calls/s to 5/s • Trends • Free disk space decreasing over time • Outliers • Problematic traces • Algorithms heavily use Median Absolute Deviation (MAD) • en.wikipedia.org/wiki/Median_absolute_deviation • is more robust “average” Health Analysis
  12. Sudden Drop / Increase “Twitter Paper”: Leveraging Cloud Data to

    Mitigate User Experience from Baking Bad
 E-DIVISIVE WITH MEDIANS arxiv.org/pdf/1411.7955.pdf
  13. Components are also analyzed with specific health signatures • Rules

    specific to the component • Disk Space • Garbage Collection • Cluster State • Generic rules • Hit Rate • Throughput • Large constantly growing knowledge base Health Analysis
  14. • Based on raw data, many metrics correlate • Waste

    of CPU to learn what we already know • Machine learned correlation is not actionable • Correlation is not causation Why not “machine learning” these signatures?
  15. provider.with("load.1min", Double.class)
 .slidingWindow(LoadHealthHypothesis.WINDOW_LENGTH, MINUTES)
 .mapWithContext((context, values) -> {
 long cpuCount

    = cpuCount(context);
 if (cpuCount > 0) {
 return values.entrySet() .stream() .filter((entry) -> entry.getValue() > cpuCount * 2) .count();
 }
 return 0L;
 })
 .filter(hypothesis.predicate())
 .mapWithContext(hypothesis.health())
 .toHealth("load"); Knowledge Implementation Example
  16. • Incidents are reported based of service health problems •

    Incident engine collects related issues and changes by traversing the graph • To appear in an incident report, an issue or change needs to be in chronological proximity or proximity on the graph • Example: Slow login service • High CPU on the database host connected via trace • High garbage collection on the JVM running the service • Disk full on the host serving the database • Volume capacity reduced by configuration What makes an Incident
  17. • Prediction is really hard across systems which constantly change

    • Specific health rules can operate in a more confined problem space • A JVM runs more and more garbage collection • A disc continues to have less free disc space • Increasing select times on a database • A connection pool becomes full • Capacity related metrics allow prediction • Throughput starts degrading • Latency starts increasing with certain load How much does Instana Predict?
  18. Agent Elasticsearch sensor Tomcat sensor JVM sensor Linux sensor Auto

    Discovery Communication Local
 Sensor Memory & Context
 Compression Agent Design • Specific sensors collect specific metrics
 Reduces noise • 1 second resolution where applicable
 Ensures high accuracy of data • Agent compresses to reduce bandwidth usage
  19. • Java 8 • Apache Karaf 4 OSGi Container •

    Sensors implemented as OSGi bundles • Fully pluggable and auto updatable • Agent bundles provide infrastructure • Sensors use regular drivers to collect metrics • Tracing done by instrumentation • Open Tracing Implementor • “Dumb” Agent - Process data in backend Agent Design - Technology
  20. Data Ingestion &
 Health Calculation Realtime Stream Processing Incident Detection

    Alerting System Management Dependency Health Metrics Dynamic Knowledge Graph API & CLI Configuration Backend Design • Decompress Data into overall system state
 We know the state of system under monitoring at any time • Apply semantics to the raw data
 Understand high CPU load is bad, high throughput is good. • Correlate semantics across individual components
 When throughput goes down, it could be correlated to CPU usage going up.
  21. • Java 8 • Dropwizard • RxJava & Reactor •

    HdrHistogram • Home grown in memory graph DB • Elasticsearch • Cassandra • Redis • Kafka • UI with React and Three.js Backend Design - Technology
  22. • 100+ machines in our cloud • 500 CPUs •

    2TB RAM • 2 Regions and 3 Availability Zones • 500 nomad services • Cassandra 56TB 100k write/s • Elasticsearch 32TB 20k write/s • 5.5 billion documents • 500GB Trace data every day Backend Design - Sizing End of 2016
  23. • Complex Systems always have failures www.instana.com/blog/no-root-cause-microservice-applications • Component vs

    Service Health www.instana.com/blog/monitoring-microservices-part-ii-understand- analyze-derive-application-health • Focus on objectives of services www.instana.com/blog/managing-quality-service-part-management- service-quality-model Further reading
  24. • Understand semantics • Build meta-models • Do not calculate

    obvious and non-sensical • Do not read from a database • Processing live data is much faster • RAM is cheap • Optimize your code regularly • Stream processing big data has high multipliers Take-Aways