Berlin 2013 - Session - Theo Schlossnagle

Monitoring... where the hell are we going? Monday, September 23,
13

The disguises of @postwait Monday, September 23, 13

We used to ping things... and email to pager on
failure. Monday, September 23, 13

SNMP from switches stored it in 5 minutes rollups... for
a while. Monday, September 23, 13

the pings left ICMP land and we would “ping” services
on other protocols: HTTP(S), via JDBC, NTP, DHCP, ssh, etc. Monday, September 23, 13

SNMP off everything still with poor resolution and retention... with
traps for alerting. Monday, September 23, 13

glue glue glue write a script -> good, bad, ugly.
Monday, September 23, 13

glue glue glue write a script -> snmpagent -> cacti

still no arbitrary data just plain crap for brains when
it comes to generic data exposure for metrics. Monday, September 23, 13

then the tables turned and we started to push data...
not just traps. Monday, September 23, 13

and we all got stupid and decided pulling data wasn’t
worthwhile... or self-deceived about it’s scalability. Monday, September 23, 13

push vs. pull is reliably the dumbest conversation I have
with smart people Monday, September 23, 13

When is push sensible? 1. events happen often and each
is unique. 2. events happen infrequently (traps) 3. you want to poll, but you can’t Monday, September 23, 13

When is poll sensible? 1. you are measuring a ﬁxed
thing 2. you need to control measurement frequency 3. you care about temporally proximal measurements 4. the device can’t push 5. you want to push, but you can’t Monday, September 23, 13

push and pull in the future products will not care
how data arrives... and the future is here... (many products do this; some via decoupling) Monday, September 23, 13

we used to measure... network stuff... then server stuff... then
application stuff... Monday, September 23, 13

measure ALL THE THINGS yet people don’t truly understand this...
many don’t have the organizational purview... do you? Monday, September 23, 13

Networks Monday, September 23, 13

Systems Monday, September 23, 13

Applications Monday, September 23, 13

Finance Monday, September 23, 13

HR Monday, September 23, 13

Engineering Monday, September 23, 13

so, for polled data... visualization... trending... projections (curve ﬁtting, regressions)...
predictions... Monday, September 23, 13

It’s not too complicated. Everything we’ve done, still works. Monday,
September 23, 13

We all know averages lie But they are useful anyway

Some other stats... can indicate how useful they actually are.

Index of dispersion? Can be useful for some data, and
misleading for others ² Monday, September 23, 13

Ad-hoc fault detection Too high, too low... Really too high,
really too low... Monday, September 23, 13

Advanced fault detection This is hard (as in, unsolved) Step
1: limit your problem space don’t try to “detect anomalies” instead “detect anomalies in disk usage” (or something equally speciﬁc) Step 2: models that apply to little data don’t apply to big data (and vice versa) Never forget to consider characteristics of data. Monday, September 23, 13

Lots of choices Dynamic time warping Holts-Winter (and other friends)
k-means, clustering and ﬁtting goodness Markov models Bayesian or Maximum Entropy classiﬁers Static time shifting Monday, September 23, 13

so, for pushed data... and by that I mean high
volume. Monday, September 23, 13

I have good news... and bad news. Monday, September 23,
13

More data = better decisions but only if you understand
control systems, engineering measurements, and a whole lot of math. Monday, September 23, 13

The problem we face today is to few engineers modeling
out systems Monday, September 23, 13

What’s this? hint: it’s a distribution Monday, September 23, 13

What’s this? hint: it’s another distribution Monday, September 23, 13

we know what they are but do we know when
to apply each Monday, September 23, 13

Histograms or at least understanding population distributions Monday, September 23,
13

Synthetic measurements Show rates of things, but not the things
themselves. Monday, September 23, 13

Instrumented measurements Collect all the data, leaving us to struggle
making sense of it all. Monday, September 23, 13

fault detection we are still applying little-data techniques to our
big data fault detection problems Monday, September 23, 13

future fault detection will analyze and classify population data to
understand systems better... classify they behavior... detect changes in inertia. Monday, September 23, 13

The Challenge is putting all this magic into one system

system (n.) a set of connected things or parts forming
a complex whole Monday, September 23, 13

my impressions alerting isn’t an issue... here’s why... Monday, September
23, 13

my impressions I think some of the lines we’ve drawn
between components might not be in the right places Monday, September 23, 13

my impressions I think that collecting little-data and big-data “differently”
may prevent some useful strategies We use the reconnoiter product to solve this. We’re happy. Monday, September 23, 13

my impressions I think that storing different types of data
separately will prevent (or severely hinder) scientiﬁc exploration of that data. Monday, September 23, 13

my impressions I think that online decisions on data and
ofﬂine decisions of data are less than the sum of their parts. Monday, September 23, 13

my impressions I think there is room for some wicked
disruption. Monday, September 23, 13

my impressions I think the true innovations will not happen
by us. Monday, September 23, 13

Thanks! Circonus is hiring exceptional engineers, mathematicians, quants, software engineers
and sales people. Monday, September 23, 13

Berlin 2013 - Session - Theo Schlossnagle

Berlin 2013 - Session - Theo Schlossnagle

More Decks by Monitorama

Featured

Transcript