Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Boston 2013 - Session - John E. Vincent

Monitorama
March 28, 2013
270

Boston 2013 - Session - John E. Vincent

Monitorama

March 28, 2013
Tweet

Transcript

  1. “We'll be hearing talks from leading open source developers and

    web operations luminaries, and then taking what we've learned to apply it towards advancing the state of open source monitoring and trending software.”
  2. ALLSPAW AUTHOR AUTHOR LOGSTASH CEPMON An actual fucking scientist RIEMANN

    GRAPHITE RAILSMACHINE GITHUB SENSU 37signals Heroku Github BOUNDARY PAPERLESS POST ETSY LIBRATO AUTHOR Wrote all the software in the world
  3. • Complains a lot on the internet • Hates Maven

    • Won't stop complaining on the internet • Hates MongoDB • Does this guy ever stop complaining?
  4. “We'll be hearing talks from leading open source developers and

    web operations luminaries, and then taking what we've learned to apply it towards advancing the state of open source monitoring and trending software.”
  5. Automation • Automation is cool and awesome and necessary •

    Automate everything! • Automate our scaling based on our metrics!
  6. Oh hey! A random UDP packet with metric data. Lemme

    just automatically launch a few new EC2 instances.....
  7. We can no longer ignore the lack of security in

    our data collection systems. Seriously.
  8. Telemetry Data • Collect ALL the metrics. • Even the

    ones we don't understand • Stuff said on Twitter and Facebook • Nordic Arachnid Flatulence.
  9. Real talk • Realize you can't store every bit of

    data you collect forever • You would likely need more hardware to retain the data about your infrastructure than the infrastructure itself • Logs are likely still your richest source of data • Don't collect something just because it's tradition • We need science not folklore
  10. What you need to know • Is the system doing

    what it's supposed to? • Is the business able to do what it's supposed to?
  11. • I shouldn't need to go to a Tufte seminar

    to understand why shit is broken. • We need a GoF for common visualizations of common data points • Watch out for mud radios
  12. Pager Duty Alert. You have one triggered incident on US

    PROD N-A-G-I-O-S. The failure is …..
  13. Things I do alert on • JVM OOMs • Latency

    • Connection pool failures • Is shit working?
  14. Things we need to accept • JSON costs you precision

    • The JVM is not the end of the world • Nagios isn't going anywhere • Each additional component creates management overhead