Slide 1

Slide 1 text

Finding Meaning in your Operational Data Presented by Brad Lhotsky

Slide 2

Slide 2 text

Who Am I? • Systems and Security
 at Craigslist • Infrastructure Monitoring & Security
 at Booking.com • Recovering • Perl Programmer • Linux/BSD Systems Admin • Network Security Specialist • PostgreSQL Administrator • ElasticSearch Janitor • DNS Voyeur • OSSEC Core Team Member

Slide 3

Slide 3 text

Expectations ‣Operational Data Types ‣Uses for Operational Data ‣Meaningful Features ‣Data Stores and Brad, A Love Story?

Slide 4

Slide 4 text

Operational Data Types

Slide 5

Slide 5 text

Administrative & Meta-Data ‣ Inventory ‣ Hardware ‣ Software ‣ Builds or Roles ‣ Services ‣ Users and Groups ‣ Employee/Contractor ‣ Managers ‣ ACLs

Slide 6

Slide 6 text

Monitoring ‣ State ‣ OK / NOT OK ‣ UP / DOWN ‣ Package Version ‣ Time Series ‣ Counter ‣ Rate ‣ Statistical Summaries

Slide 7

Slide 7 text

Events ‣ State Changes ‣ Package Updated ‣ Service Stopped ‣ System Events ‣ Syslog Message ‣ SNMP Traps ‣ Application Events ‣ Access ‣ Errors ‣ Traces

Slide 8

Slide 8 text

Uses for Operational Data Deving and Oping Your Data

Slide 9

Slide 9 text

Monitoring and Metrics

Slide 10

Slide 10 text

-Nicole Forsgren - Monitorama PDX 2016 How Metrics Shape Your Culture “Metrics are your culture.”

Slide 11

Slide 11 text

"How Metrics Shape Your Culture" • You can't improve what you don't measure • Always measure things that matter • Things measured are things managed • Metrics can be gamed • Metrics inform incentives • Not everything that can be counted counts • Hard to measure doesn't mean it isn't worth measuring

Slide 12

Slide 12 text

... and that's probably O.K. All of your monitoring is probably wrong.

Slide 13

Slide 13 text

Automation ‣ Can a Machine read my data? ‣ Autoscale ‣ Trend detection ‣ Service Level Roll Ups ‣ Reporting

Slide 14

Slide 14 text

Capacity Planning ‣ Predicting System Stress Levels ‣ Make Intelligent Projections ‣ Test those predictions

Slide 15

Slide 15 text

Alerting ‣ Disrupting People's Lives at 95% Disk Full ‣ Thresholds -> Change Detection ‣ State Changes

Slide 16

Slide 16 text

Anomaly Detection • Anomalies != Alerts • 1 million metrics per minute • 0.3% at a distance of > 3σ • 4.3 Million Anomalies / day

Slide 17

Slide 17 text

Statistics Sidebar • Your data probably isn't normal • You need to use modeling to perform anomaly detection • The residuals should fit a normal distribution • Modeling is only possible if you can explore and interact with the data • There are algorithms and their parameters matter

Slide 18

Slide 18 text

Exploration

Slide 19

Slide 19 text

R Studio R OpenSource Programming Language IDE

Slide 20

Slide 20 text

R's Tidyverse Common data-cleanup, normalization, manipulation, and display

Slide 21

Slide 21 text

Apache Kafka & Spark Robust data pipeline with interactivity and deployment capabilities

Slide 22

Slide 22 text

Meaningful Features

Slide 23

Slide 23 text

Open and Extensible ‣ Integrations with other projects ‣ Open API ‣ Good Documentation ‣ Modular / Plugin Structure ‣ Community

Slide 24

Slide 24 text

Reliability vs. Performance

Slide 25

Slide 25 text

Retention ‣ How easy is it to age data off? ‣ What regulations of laws apply to the data? ‣ Expectations from: ‣ Customers ‣ Employees ‣ Managers ‣ Peers ‣ Legal

Slide 26

Slide 26 text

Privacy and Security ‣ What do you keep on your users in your ops data? ‣ Who might come calling for it? ‣ How comfortable are you handing it over to Trump? ‣ Anyone hear about MongoDB? ‣ Can the store provide RBAC?

Slide 27

Slide 27 text

Places People Stick DevOps Datas

Slide 28

Slide 28 text

‣ Large Community ‣ Forks and Oracle ‣ Performance First ‣ SQL Interface ‣ Limited Data Types ‣ Web > BI ‣ Suitable for Administrative Data

Slide 29

Slide 29 text

‣ Large Community ‣ Reliability First ‣ Open and Extensible ‣ PGXN ‣ CitusData ‣ GreenPlum ‣ EnterpriseDB ‣ Native Support for IP Addresses ‣ Extensible Data Types ‣ Suitable for Administrative Data

Slide 30

Slide 30 text

‣ Large Community ‣ Interchangeable Components, ala, MicroServices ‣ Simple API ‣ Rampant Open Source Adoption ‣ Scalable ‣ Compatibility ‣ Grafana, Statsd, Riemann, Bosun, Cabot, Seyren ‣ etc., etc., ‣ Suitable for Time Series Data ‣ Smallest Resolution: seconds

Slide 31

Slide 31 text

security.logging.indexer.*.total Metrics: Wildcards

Slide 32

Slide 32 text

sumSeries(security.logging.indexer.*.total) Combining Metrics

Slide 33

Slide 33 text

alias(sumSeries(security.logging.indexer.*.total),”Today") alias( timeShift( sumSeries(security.logging.indexer.*.total), “7d"), "Last Week") Comparing Metrics

Slide 34

Slide 34 text

alias(alpha(color(areaBetween( holtWintersConfidenceBands( maxSeries(general.es.*.jvm.mem.heap_used_bytes) ) ),“gray"),0.1),"Hot Winter Confidence Bands”) color(alias( maxSeries(general.es.*.jvm.mem.heap_used_bytes), "Max Heap Size"),"red") Advanced Tricks

Slide 35

Slide 35 text

‣ Metrics 2.0 (metrics support tagging) ‣ Hadoop / Hbase backed ‣ SQL-like Language ‣ Zero Data Loss ‣ Compatibility ‣ Carbon, Grafana, Statsd, Riemann, Bosun ‣ Suitable for Time Series Data ‣ Smallest Resolution: milliseconds

Slide 36

Slide 36 text

‣ Metrics 2.0 (metrics support tagging) ‣ SQL-like Language ‣ Zero Data Loss ‣ Compatibility ‣ Carbon, Grafana, Statsd, Riemann, Bosun ‣ Suitable for Time Series Data ‣ Smallest Resolution: nanoseconds

Slide 37

Slide 37 text

‣ Well Documented API ‣ Many Open Source Integrations ‣ Lucene backed text search ‣ Scalable ‣ "Jepsen ElasticSearch" re:CAP

Slide 38

Slide 38 text

Recap

Slide 39

Slide 39 text

Data Types Meta-Data State Time Series Events Graphite No No Yes Kinda InfluxDB Kinda Yes Yes Kinda OpenTSDB Kinda Yes Yes Kinda MySQL Yes Yes No Kinda PostgreSQL Yes Yes No Kinda ElasticSearch No Kinda Kinda Yes

Slide 40

Slide 40 text

Features of Your Data Interval Cardinality Data Type Aging Graphite Fixed, Regular Low Numeric Roll up InfluxDB Fixed (best) Any High Any Configurable OpenTSDB Any High Numeric n/a MySQL Any Keys: Low Values: High Structured* None PostgreSQL Any Keys: Low Values: High Structured* None ElasticSearch Any Keys: Low Values: High Any None

Slide 41

Slide 41 text

Features of the Store Security Scalability Performance Reliability Graphite Low High High* Medium InfluxDB Low High Medium High OpenTSDB Low High Low High MySQL Medium Medium Medium* High* PostgreSQL High Medium Medium* High ElasticSearch Low High High Low

Slide 42

Slide 42 text

Thank you! [email protected] https://twitter.com/reyjrar https://github.com/reyjrar https://speakerdeck.com/reyjrar https://www.craigslist.org/about/craigslist_is_hiring