Pro Yearly is on sale from $80 to $50! »

Berlin 2013 - Session - Theo Schlossnagle

0580d500edfdb2e5e80e4732ac8df1ea?s=47 Monitorama
September 19, 2013
360

Berlin 2013 - Session - Theo Schlossnagle

0580d500edfdb2e5e80e4732ac8df1ea?s=128

Monitorama

September 19, 2013
Tweet

Transcript

  1. Monitoring... where the hell are we going? Monday, September 23,

    13
  2. The disguises of @postwait Monday, September 23, 13

  3. The disguises of @postwait Monday, September 23, 13

  4. We used to ping things... and email to pager on

    failure. Monday, September 23, 13
  5. SNMP from switches stored it in 5 minutes rollups... for

    a while. Monday, September 23, 13
  6. the pings left ICMP land and we would “ping” services

    on other protocols: HTTP(S), via JDBC, NTP, DHCP, ssh, etc. Monday, September 23, 13
  7. SNMP off everything still with poor resolution and retention... with

    traps for alerting. Monday, September 23, 13
  8. glue glue glue write a script -> good, bad, ugly.

    Monday, September 23, 13
  9. glue glue glue write a script -> snmpagent -> cacti

    Monday, September 23, 13
  10. still no arbitrary data just plain crap for brains when

    it comes to generic data exposure for metrics. Monday, September 23, 13
  11. then the tables turned and we started to push data...

    not just traps. Monday, September 23, 13
  12. and we all got stupid and decided pulling data wasn’t

    worthwhile... or self-deceived about it’s scalability. Monday, September 23, 13
  13. push vs. pull is reliably the dumbest conversation I have

    with smart people Monday, September 23, 13
  14. When is push sensible? 1. events happen often and each

    is unique. 2. events happen infrequently (traps) 3. you want to poll, but you can’t Monday, September 23, 13
  15. When is poll sensible? 1. you are measuring a fixed

    thing 2. you need to control measurement frequency 3. you care about temporally proximal measurements 4. the device can’t push 5. you want to push, but you can’t Monday, September 23, 13
  16. push and pull in the future products will not care

    how data arrives... and the future is here... (many products do this; some via decoupling) Monday, September 23, 13
  17. we used to measure... network stuff... then server stuff... then

    application stuff... Monday, September 23, 13
  18. measure ALL THE THINGS yet people don’t truly understand this...

    many don’t have the organizational purview... do you? Monday, September 23, 13
  19. Networks Monday, September 23, 13

  20. Systems Monday, September 23, 13

  21. Applications Monday, September 23, 13

  22. Finance Monday, September 23, 13

  23. Finance Monday, September 23, 13

  24. Finance Monday, September 23, 13

  25. HR Monday, September 23, 13

  26. Engineering Monday, September 23, 13

  27. so, for polled data... visualization... trending... projections (curve fitting, regressions)...

    predictions... Monday, September 23, 13
  28. It’s not too complicated. Everything we’ve done, still works. Monday,

    September 23, 13
  29. It’s not too complicated. Everything we’ve done, still works. Monday,

    September 23, 13
  30. We all know averages lie But they are useful anyway

    Monday, September 23, 13
  31. Some other stats... can indicate how useful they actually are.

    Monday, September 23, 13
  32. Index of dispersion? Can be useful for some data, and

    misleading for others ² Monday, September 23, 13
  33. Ad-hoc fault detection Too high, too low... Really too high,

    really too low... Monday, September 23, 13
  34. Advanced fault detection This is hard (as in, unsolved) Step

    1: limit your problem space don’t try to “detect anomalies” instead “detect anomalies in disk usage” (or something equally specific) Step 2: models that apply to little data don’t apply to big data (and vice versa) Never forget to consider characteristics of data. Monday, September 23, 13
  35. Lots of choices Dynamic time warping Holts-Winter (and other friends)

    k-means, clustering and fitting goodness Markov models Bayesian or Maximum Entropy classifiers Static time shifting Monday, September 23, 13
  36. so, for pushed data... and by that I mean high

    volume. Monday, September 23, 13
  37. I have good news... and bad news. Monday, September 23,

    13
  38. More data = better decisions but only if you understand

    control systems, engineering measurements, and a whole lot of math. Monday, September 23, 13
  39. The problem we face today is to few engineers modeling

    out systems Monday, September 23, 13
  40. What’s this? hint: it’s a distribution Monday, September 23, 13

  41. What’s this? hint: it’s another distribution Monday, September 23, 13

  42. What’s this? hint: it’s another distribution Monday, September 23, 13

  43. What’s this? hint: it’s another distribution Monday, September 23, 13

  44. we know what they are but do we know when

    to apply each Monday, September 23, 13
  45. Histograms or at least understanding population distributions Monday, September 23,

    13
  46. Synthetic measurements Show rates of things, but not the things

    themselves. Monday, September 23, 13
  47. Instrumented measurements Collect all the data, leaving us to struggle

    making sense of it all. Monday, September 23, 13
  48. fault detection we are still applying little-data techniques to our

    big data fault detection problems Monday, September 23, 13
  49. future fault detection will analyze and classify population data to

    understand systems better... classify they behavior... detect changes in inertia. Monday, September 23, 13
  50. The Challenge is putting all this magic into one system

    Monday, September 23, 13
  51. system (n.) a set of connected things or parts forming

    a complex whole Monday, September 23, 13
  52. my impressions alerting isn’t an issue... here’s why... Monday, September

    23, 13
  53. my impressions I think some of the lines we’ve drawn

    between components might not be in the right places Monday, September 23, 13
  54. my impressions I think that collecting little-data and big-data “differently”

    may prevent some useful strategies We use the reconnoiter product to solve this. We’re happy. Monday, September 23, 13
  55. my impressions I think that storing different types of data

    separately will prevent (or severely hinder) scientific exploration of that data. Monday, September 23, 13
  56. my impressions I think that online decisions on data and

    offline decisions of data are less than the sum of their parts. Monday, September 23, 13
  57. my impressions I think there is room for some wicked

    disruption. Monday, September 23, 13
  58. my impressions I think the true innovations will not happen

    by us. Monday, September 23, 13
  59. Thanks! Circonus is hiring exceptional engineers, mathematicians, quants, software engineers

    and sales people. Monday, September 23, 13