Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin 2013 - Session - Theo Schlossnagle

Monitorama
September 19, 2013
400

Berlin 2013 - Session - Theo Schlossnagle

Monitorama

September 19, 2013
Tweet

Transcript

  1. We used to ping things... and email to pager on

    failure. Monday, September 23, 13
  2. the pings left ICMP land and we would “ping” services

    on other protocols: HTTP(S), via JDBC, NTP, DHCP, ssh, etc. Monday, September 23, 13
  3. SNMP off everything still with poor resolution and retention... with

    traps for alerting. Monday, September 23, 13
  4. still no arbitrary data just plain crap for brains when

    it comes to generic data exposure for metrics. Monday, September 23, 13
  5. then the tables turned and we started to push data...

    not just traps. Monday, September 23, 13
  6. and we all got stupid and decided pulling data wasn’t

    worthwhile... or self-deceived about it’s scalability. Monday, September 23, 13
  7. push vs. pull is reliably the dumbest conversation I have

    with smart people Monday, September 23, 13
  8. When is push sensible? 1. events happen often and each

    is unique. 2. events happen infrequently (traps) 3. you want to poll, but you can’t Monday, September 23, 13
  9. When is poll sensible? 1. you are measuring a fixed

    thing 2. you need to control measurement frequency 3. you care about temporally proximal measurements 4. the device can’t push 5. you want to push, but you can’t Monday, September 23, 13
  10. push and pull in the future products will not care

    how data arrives... and the future is here... (many products do this; some via decoupling) Monday, September 23, 13
  11. we used to measure... network stuff... then server stuff... then

    application stuff... Monday, September 23, 13
  12. measure ALL THE THINGS yet people don’t truly understand this...

    many don’t have the organizational purview... do you? Monday, September 23, 13
  13. Index of dispersion? Can be useful for some data, and

    misleading for others ² Monday, September 23, 13
  14. Ad-hoc fault detection Too high, too low... Really too high,

    really too low... Monday, September 23, 13
  15. Advanced fault detection This is hard (as in, unsolved) Step

    1: limit your problem space don’t try to “detect anomalies” instead “detect anomalies in disk usage” (or something equally specific) Step 2: models that apply to little data don’t apply to big data (and vice versa) Never forget to consider characteristics of data. Monday, September 23, 13
  16. Lots of choices Dynamic time warping Holts-Winter (and other friends)

    k-means, clustering and fitting goodness Markov models Bayesian or Maximum Entropy classifiers Static time shifting Monday, September 23, 13
  17. so, for pushed data... and by that I mean high

    volume. Monday, September 23, 13
  18. More data = better decisions but only if you understand

    control systems, engineering measurements, and a whole lot of math. Monday, September 23, 13
  19. The problem we face today is to few engineers modeling

    out systems Monday, September 23, 13
  20. we know what they are but do we know when

    to apply each Monday, September 23, 13
  21. Instrumented measurements Collect all the data, leaving us to struggle

    making sense of it all. Monday, September 23, 13
  22. fault detection we are still applying little-data techniques to our

    big data fault detection problems Monday, September 23, 13
  23. future fault detection will analyze and classify population data to

    understand systems better... classify they behavior... detect changes in inertia. Monday, September 23, 13
  24. system (n.) a set of connected things or parts forming

    a complex whole Monday, September 23, 13
  25. my impressions I think some of the lines we’ve drawn

    between components might not be in the right places Monday, September 23, 13
  26. my impressions I think that collecting little-data and big-data “differently”

    may prevent some useful strategies We use the reconnoiter product to solve this. We’re happy. Monday, September 23, 13
  27. my impressions I think that storing different types of data

    separately will prevent (or severely hinder) scientific exploration of that data. Monday, September 23, 13
  28. my impressions I think that online decisions on data and

    offline decisions of data are less than the sum of their parts. Monday, September 23, 13
  29. my impressions I think there is room for some wicked

    disruption. Monday, September 23, 13