$30 off During Our Annual Pro Sale. View Details »

Monitorama 2013

Monitorama 2013

Slides from my session at Monitorama 2013

John Vincent

March 28, 2013
Tweet

More Decks by John Vincent

Other Decks in Technology

Transcript

  1. So.... Monitorama What's THAT all about?

  2. “We'll be hearing talks from leading open source developers and

    web operations luminaries, and then taking what we've learned to apply it towards advancing the state of open source monitoring and trending software.”
  3. None
  4. ALLSPAW AUTHOR AUTHOR LOGSTASH CEPMON An actual fucking scientist RIEMANN

    GRAPHITE RAILSMACHINE GITHUB SENSU 37signals Heroku Github BOUNDARY PAPERLESS POST ETSY LIBRATO AUTHOR Wrote all the software in the world
  5. • Complains a lot on the internet • Hates Maven

    • Won't stop complaining on the internet • Hates MongoDB • Does this guy ever stop complaining?
  6. None
  7. #monitoringsucks

  8. #monitoringlove

  9. “We'll be hearing talks from leading open source developers and

    web operations luminaries, and then taking what we've learned to apply it towards advancing the state of open source monitoring and trending software.”
  10. Security

  11. Automation • Automation is cool and awesome and necessary •

    Automate everything! • Automate our scaling based on our metrics!
  12. None
  13. Oh hey! A random UDP packet with metric data. Lemme

    just automatically launch a few new EC2 instances.....
  14. We can no longer ignore the lack of security in

    our data collection systems. Seriously.
  15. Retention

  16. Telemetry Data • Collect ALL the metrics. • Even the

    ones we don't understand • Stuff said on Twitter and Facebook • Nordic Arachnid Flatulence.
  17. None
  18. Oh hey. How's that network saturation looking?

  19. Real talk • Realize you can't store every bit of

    data you collect forever • You would likely need more hardware to retain the data about your infrastructure than the infrastructure itself • Logs are likely still your richest source of data • Don't collect something just because it's tradition • We need science not folklore
  20. What you need to know • Is the system doing

    what it's supposed to? • Is the business able to do what it's supposed to?
  21. Interpretation

  22. None
  23. None
  24. None
  25. None
  26. • I shouldn't need to go to a Tufte seminar

    to understand why shit is broken. • We need a GoF for common visualizations of common data points • Watch out for mud radios
  27. Alerting

  28. Pager Duty Alert. You have one triggered incident on US

    PROD N-A-G-I-O-S. The failure is …..
  29. The failure is you won't leave me the hell alone

  30. Alert fatigue is the single biggest problem we have right

    now.
  31. The Big Mistake One Event – One Alert

  32. Things I don't alert on • Memory usage • CPU

    Usage • Load Average
  33. Things I do alert on • JVM OOMs • Latency

    • Connection pool failures • Is shit working?
  34. Real Talk Part 2 • Alert on actionable things •

    Thresholds are ever-evolving
  35. We need to be more intelligent about our alerts or

    we'll all go insane.
  36. None
  37. A few final thoughts

  38. Brain dump • Rollups • Event Correlation • Riemann •

    Storm + Esper • ElasticSearch
  39. Things we need to accept • JSON costs you precision

    • The JVM is not the end of the world • Nagios isn't going anywhere • Each additional component creates management overhead
  40. Questions?

  41. Thanks! • Twitter - @lusis • Github – lusis