Finding Meaning in Operational Data

Finding Meaning in Operational Data

Overview of what operational data is, where to start, and how to leverage it.

8d96f5c273062cb617255e630fe0705c?s=128

Brad Lhotsky

June 20, 2017
Tweet

Transcript

  1. Finding Meaning in your Operational Data Presented by Brad Lhotsky

  2. Who Am I? • Systems and Security
 at Craigslist •

    Infrastructure Monitoring & Security
 at Booking.com • Recovering • Perl Programmer • Linux/BSD Systems Admin • Network Security Specialist • PostgreSQL Administrator • ElasticSearch Janitor • DNS Voyeur • OSSEC Core Team Member
  3. Expectations ‣Operational Data Types ‣Uses for Operational Data ‣Meaningful Features

    ‣Data Stores and Brad, A Love Story?
  4. Operational Data Types

  5. Administrative & Meta-Data ‣ Inventory ‣ Hardware ‣ Software ‣

    Builds or Roles ‣ Services ‣ Users and Groups ‣ Employee/Contractor ‣ Managers ‣ ACLs
  6. Monitoring ‣ State ‣ OK / NOT OK ‣ UP

    / DOWN ‣ Package Version ‣ Time Series ‣ Counter ‣ Rate ‣ Statistical Summaries
  7. Events ‣ State Changes ‣ Package Updated ‣ Service Stopped

    ‣ System Events ‣ Syslog Message ‣ SNMP Traps ‣ Application Events ‣ Access ‣ Errors ‣ Traces
  8. Uses for Operational Data Deving and Oping Your Data

  9. Monitoring and Metrics

  10. -Nicole Forsgren - Monitorama PDX 2016 How Metrics Shape Your

    Culture “Metrics are your culture.”
  11. "How Metrics Shape Your Culture" • You can't improve what

    you don't measure • Always measure things that matter • Things measured are things managed • Metrics can be gamed • Metrics inform incentives • Not everything that can be counted counts • Hard to measure doesn't mean it isn't worth measuring
  12. ... and that's probably O.K. All of your monitoring is

    probably wrong.
  13. Automation ‣ Can a Machine read my data? ‣ Autoscale

    ‣ Trend detection ‣ Service Level Roll Ups ‣ Reporting
  14. Capacity Planning ‣ Predicting System Stress Levels ‣ Make Intelligent

    Projections ‣ Test those predictions
  15. Alerting ‣ Disrupting People's Lives at 95% Disk Full ‣

    Thresholds -> Change Detection ‣ State Changes
  16. Anomaly Detection • Anomalies != Alerts • 1 million metrics

    per minute • 0.3% at a distance of > 3σ • 4.3 Million Anomalies / day
  17. Statistics Sidebar • Your data probably isn't normal • You

    need to use modeling to perform anomaly detection • The residuals should fit a normal distribution • Modeling is only possible if you can explore and interact with the data • There are algorithms and their parameters matter
  18. Exploration

  19. R Studio R OpenSource Programming Language IDE

  20. R's Tidyverse Common data-cleanup, normalization, manipulation, and display

  21. Apache Kafka & Spark Robust data pipeline with interactivity and

    deployment capabilities
  22. Meaningful Features

  23. Open and Extensible ‣ Integrations with other projects ‣ Open

    API ‣ Good Documentation ‣ Modular / Plugin Structure ‣ Community
  24. Reliability vs. Performance

  25. Retention ‣ How easy is it to age data off?

    ‣ What regulations of laws apply to the data? ‣ Expectations from: ‣ Customers ‣ Employees ‣ Managers ‣ Peers ‣ Legal
  26. Privacy and Security ‣ What do you keep on your

    users in your ops data? ‣ Who might come calling for it? ‣ How comfortable are you handing it over to Trump? ‣ Anyone hear about MongoDB? ‣ Can the store provide RBAC?
  27. Places People Stick DevOps Datas

  28. ‣ Large Community ‣ Forks and Oracle ‣ Performance First

    ‣ SQL Interface ‣ Limited Data Types ‣ Web > BI ‣ Suitable for Administrative Data
  29. ‣ Large Community ‣ Reliability First ‣ Open and Extensible

    ‣ PGXN ‣ CitusData ‣ GreenPlum ‣ EnterpriseDB ‣ Native Support for IP Addresses ‣ Extensible Data Types ‣ Suitable for Administrative Data
  30. ‣ Large Community ‣ Interchangeable Components, ala, MicroServices ‣ Simple

    API ‣ Rampant Open Source Adoption ‣ Scalable ‣ Compatibility ‣ Grafana, Statsd, Riemann, Bosun, Cabot, Seyren ‣ etc., etc., ‣ Suitable for Time Series Data ‣ Smallest Resolution: seconds
  31. security.logging.indexer.*.total Metrics: Wildcards

  32. sumSeries(security.logging.indexer.*.total) Combining Metrics

  33. alias(sumSeries(security.logging.indexer.*.total),”Today") alias( timeShift( sumSeries(security.logging.indexer.*.total), “7d"), "Last Week") Comparing Metrics

  34. alias(alpha(color(areaBetween( holtWintersConfidenceBands( maxSeries(general.es.*.jvm.mem.heap_used_bytes) ) ),“gray"),0.1),"Hot Winter Confidence Bands”) color(alias( maxSeries(general.es.*.jvm.mem.heap_used_bytes),

    "Max Heap Size"),"red") Advanced Tricks
  35. ‣ Metrics 2.0 (metrics support tagging) ‣ Hadoop / Hbase

    backed ‣ SQL-like Language ‣ Zero Data Loss ‣ Compatibility ‣ Carbon, Grafana, Statsd, Riemann, Bosun ‣ Suitable for Time Series Data ‣ Smallest Resolution: milliseconds
  36. ‣ Metrics 2.0 (metrics support tagging) ‣ SQL-like Language ‣

    Zero Data Loss ‣ Compatibility ‣ Carbon, Grafana, Statsd, Riemann, Bosun ‣ Suitable for Time Series Data ‣ Smallest Resolution: nanoseconds
  37. ‣ Well Documented API ‣ Many Open Source Integrations ‣

    Lucene backed text search ‣ Scalable ‣ "Jepsen ElasticSearch" re:CAP
  38. Recap

  39. Data Types Meta-Data State Time Series Events Graphite No No

    Yes Kinda InfluxDB Kinda Yes Yes Kinda OpenTSDB Kinda Yes Yes Kinda MySQL Yes Yes No Kinda PostgreSQL Yes Yes No Kinda ElasticSearch No Kinda Kinda Yes
  40. Features of Your Data Interval Cardinality Data Type Aging Graphite

    Fixed, Regular Low Numeric Roll up InfluxDB Fixed (best) Any High Any Configurable OpenTSDB Any High Numeric n/a MySQL Any Keys: Low Values: High Structured* None PostgreSQL Any Keys: Low Values: High Structured* None ElasticSearch Any Keys: Low Values: High Any None
  41. Features of the Store Security Scalability Performance Reliability Graphite Low

    High High* Medium InfluxDB Low High Medium High OpenTSDB Low High Low High MySQL Medium Medium Medium* High* PostgreSQL High Medium Medium* High ElasticSearch Low High High Low
  42. Thank you! brad.lhotsky@gmail.com https://twitter.com/reyjrar https://github.com/reyjrar https://speakerdeck.com/reyjrar https://www.craigslist.org/about/craigslist_is_hiring