Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring unknown unknowns with machine intelligence - DevOpsDays TLV 2017

Guy Fighel
November 11, 2017

Monitoring unknown unknowns with machine intelligence - DevOpsDays TLV 2017

Gave this talk at DevOpsDays Tel-Aviv 2017. A few real-world examples where not knowing what you don’t know led to massive outages and service disruptions. This talk has some more technical examples in Python.

Guy Fighel

November 11, 2017
Tweet

More Decks by Guy Fighel

Other Decks in Technology

Transcript

  1. @guyfig On-Call Engineer by Nature "If a tree falls in

    a forest and no one is around to hear it, does it make a sound?"
  2. Observability is a superset between monitoring and instrumentation. Making systems

    debuggable and understandable @mipsytipsy Do you really know what to observe? Instrumentation - mostly Developer driven What is the output? Dashboard? Exploration tool?
  3. one can determine the behavior of the entire system from

    the system's outputs Observability In Control Theory
  4. -Static thresholds -Defined Alerts -Static Runbooks -Anomaly Detection -Predictions -External

    Knowledge -Knowledge -Recommendations -Auto Collaboration -Inference -Auto Correlations -Semantic Analysis -Decision making The Observability Quadrant (Based on Johari window)
  5. Find The Problem Thresholds? Baseline? Anomaly? - Scale matters -

    Stationary noise matters - Use Autocorrelation
  6. Independent component analysis (ICA) separates a multivariate signal into additive

    subcomponents that are maximally independent. from sklearn.decomposition import FastICA, PCA
  7. Find The Problem CPU 90% Time in Minutes EC2 Instance

    changed from t2.small to m3.xl Events & context matters Anomly?
  8. What Can Machines Do? Process different types of data, transform

    it fast and handle huge amounts in real-time Automate and adapt Anomaly Detection Apply Semantic text similarities to find patterns (Information Retrieval) Apply auto correlation models Evolve and adapt (overtime) based on human interaction
  9. The Goal - Centralization Observability for systems with imperfect outputs

    Events enrichments, symptoms detection and inference Automatic Outlier Detection Automatic Correlation Get closer to the Control Theory mathematical definition
  10. - Define the model. Use a single schema (Apache Avro)

    - Events are agnostic. Can represent logs, stack trace, metric, user action, HTTP event, etc. - Every event should have a set of common fields as well as optional key/value attributes Get a Common Schema Use Common Schema
  11. Deterministic models are better to start with (Fuzzy Logic, Rules)

    Choose your logic and start run it across your data (schema) Apply similarity checks to strings first (TF-IDF, BM25, Fuzzy, other classifiers) Look into correlations, start with simple obvious ones, before building classifiers (Unsupervised/Semi-supervised learning is much more relevant overall) Build your prediction models on time series data first. (Statistics has solid models) Time and context are dimensions you will be able to start addressing Best Practices
  12. Use It In Production - Your team == your users

    - Ask for feedback - Re-calculate relevancy - Apply Recommendations based on your own team knowledge