Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring unknown unknowns with machine intelligence

Guy Fighel
October 24, 2017

Monitoring unknown unknowns with machine intelligence

Gave this talk at AllDaysDevOps 2017. A few real-world examples where not knowing what you don’t know led to massive outages and service disruptions. How despite the fact that modern DevOps teams have multiple monitoring tools, hundreds of metrics instrumented and are capturing billions of data points…downtime still happens. How about instead of implementing more monitoring, we bring forward a future where DevOps teams can augment their existing tooling with AI and machine learning to draw richer correlations across events, metrics and logs to surface insights about threats to uptime that aren’t even being monitored. Or put another way, how DevOps teams can get closer to a state of “known knowns!”

Guy Fighel

October 24, 2017
Tweet

More Decks by Guy Fighel

Other Decks in Technology

Transcript

  1. @guyfig On-Call Engineer by Nature "If a tree falls in

    a forest and no one is around to hear it, does it make a sound?"
  2. Monitoring is the action of observing and checking the behaviour

    and outputs of a system and its components over time. @grepory, Monitorama 2016 Do you really know what to observe? Over time? How long can you remember? Systems are complex.
  3. Observability is a superset between monitoring and instrumentation. Making systems

    debuggable and understandable @mipsytipsy Do you really know what to observe? Instrumentation - mostly Developer driven What is the output? Dashboard? Exploration tool?
  4. A system is said to be observable if, for any

    possible sequence of state and control vectors the current state can be determined in finite time using only the outputs. If a system is not observable, this means that the current values of some of its states cannot be determined through output sensors. This implies that their value is unknown to the controller. Observability In Control Theory
  5. -Static thresholds -Defined Alerts -Static Runbooks -Anomaly Detection -Predictions -External

    Knowledge -Knowledge -Recommendations -Auto Collaboration -Inference -Auto Correlations -Semantic Analysis -Decision making The Observability Quadrant (Based on Johari window)
  6. The I-Space Public knowledge (industry best practices) Proprietary knowledge (our

    specific applications) Personal knowledge (experience, troubleshooting) Common sense (What everyone knows) When we set-up human driven detection/monitoring, we operate mostly on non diffused knowledge
  7. Find The Problem Thresholds? Baseline? Anomaly? - Scale matters -

    Stationary noise matters - Use Autocorrelation
  8. Find The Problem CPU 90% Time in Minutes EC2 Instance

    changed from t2.small to m3.xl Events & context matters Anomly?
  9. What Humans Can (Should) Do? Cover more detectors, write more

    checks and enforce instrumentation (this is the only way to defuse knowledge and codify it) Start sending Events. Every event. They tell a story and give context (Logs should be structured and emit as events as well) Use metrics with high cardinality as possible. Embed context inside the metric (labels) Don’t use dashboards just cause they are pretty. There is no point Alerts are facts. Make sure you threat them as such
  10. Machine Intelligence “machine intelligence” is a unifying term for what

    others call machine learning and artificial intelligence. Specifically in our context: 1. Natural Language Understanding/Processing 2. Statistical Analysis/models 3. Unsupervised/Supervised Learning 4. Reasoning, Inference 5. Recommendation Modeling 6. Knowledge Representation
  11. What Can Machines Do? Process different types of data, transform

    it fast and handle huge amounts in real-time Automate and adapt Anomaly Detection Apply Semantic text similarities to find patterns Apply auto correlation models Evolve and adapt (overtime) based on human interaction
  12. The Goal Observability for systems with imperfect outputs Events enrichments,

    symptoms detection and inference Automatic Outlier Detection Automatic Correlation Get closer to the Control Theory mathematical definition
  13. - Define the model. Use a single schema (Apache Avro)

    - Event are agnostic. Can represent logs, stack trace, metric, user action, HTTP event, etc. - Every event should have a set of common fields as well as optional key/value attributes Get a Common Schema Use Common Schema
  14. Deterministic models are better to start with (Fuzzy Logic, Rules)

    Choose your logic and start run it across your data (schema) Apply similarity checks to strings first (TF-IDF, BM25, Fuzzy, other classifiers) Look into correlations, start with simple obvious ones, before building classifiers (Unsupervised learning is much more relevant overall) Build your prediction models on time series data first. (Statistics has solid models) Time and context are dimensions you will be able to start addressing Best Practices
  15. Use It In Production Test your logic in production Improve

    and get real feedback from your on-call engineers Build an automated feedback to adapt the models