$30 off During Our Annual Pro Sale. View Details »

Monitorama 2013

Monitorama 2013

Slides from my session at Monitorama 2013

John Vincent

March 28, 2013
Tweet

More Decks by John Vincent

Other Decks in Technology

Transcript

  1. So.... Monitorama
    What's THAT all about?

    View Slide

  2. “We'll be hearing talks from leading open source
    developers and web operations luminaries, and then
    taking what we've learned to apply it towards advancing
    the state of open source monitoring and trending
    software.”

    View Slide

  3. View Slide

  4. ALLSPAW
    AUTHOR
    AUTHOR
    LOGSTASH
    CEPMON
    An actual
    fucking
    scientist
    RIEMANN
    GRAPHITE
    RAILSMACHINE
    GITHUB
    SENSU
    37signals
    Heroku
    Github
    BOUNDARY
    PAPERLESS
    POST
    ETSY
    LIBRATO
    AUTHOR
    Wrote all
    the
    software
    in the
    world

    View Slide


  5. Complains a lot on the internet

    Hates Maven

    Won't stop complaining on the internet

    Hates MongoDB

    Does this guy ever stop complaining?

    View Slide

  6. View Slide

  7. #monitoringsucks

    View Slide

  8. #monitoringlove

    View Slide

  9. “We'll be hearing talks from leading open source developers
    and web operations luminaries, and then taking what we've
    learned to apply it towards advancing the state of open source
    monitoring and trending software.”

    View Slide

  10. Security

    View Slide

  11. Automation

    Automation is cool and awesome and necessary

    Automate everything!

    Automate our scaling based on our metrics!

    View Slide

  12. View Slide

  13. Oh hey! A random UDP packet with metric data.
    Lemme just automatically launch a few new EC2 instances.....

    View Slide

  14. We can no longer ignore the lack of security in our data
    collection systems. Seriously.

    View Slide

  15. Retention

    View Slide

  16. Telemetry Data

    Collect ALL the metrics.

    Even the ones we don't understand

    Stuff said on Twitter and Facebook

    Nordic Arachnid Flatulence.

    View Slide

  17. View Slide

  18. Oh hey. How's that network saturation looking?

    View Slide

  19. Real talk

    Realize you can't store every bit of data you collect
    forever

    You would likely need more hardware to retain the data
    about your infrastructure than the infrastructure itself

    Logs are likely still your richest source of data

    Don't collect something just because it's tradition

    We need science not folklore

    View Slide

  20. What you need to know

    Is the system doing what it's supposed to?

    Is the business able to do what it's supposed to?

    View Slide

  21. Interpretation

    View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide


  26. I shouldn't need to go to a Tufte seminar to understand
    why shit is broken.

    We need a GoF for common visualizations of common
    data points

    Watch out for mud radios

    View Slide

  27. Alerting

    View Slide

  28. Pager Duty Alert. You have one triggered incident on US
    PROD N-A-G-I-O-S. The failure is …..

    View Slide

  29. The failure is you won't leave me the hell alone

    View Slide

  30. Alert fatigue is the single biggest problem we have right
    now.

    View Slide

  31. The Big Mistake
    One Event – One Alert

    View Slide

  32. Things I don't alert on

    Memory usage

    CPU Usage

    Load Average

    View Slide

  33. Things I do alert on

    JVM OOMs

    Latency

    Connection pool failures

    Is shit working?

    View Slide

  34. Real Talk Part 2

    Alert on actionable things

    Thresholds are ever-evolving

    View Slide

  35. We need to be more intelligent about our alerts or we'll all go
    insane.

    View Slide

  36. View Slide

  37. A few final thoughts

    View Slide

  38. Brain dump

    Rollups

    Event Correlation

    Riemann

    Storm + Esper

    ElasticSearch

    View Slide

  39. Things we need to accept

    JSON costs you precision

    The JVM is not the end of the world

    Nagios isn't going anywhere

    Each additional component creates management
    overhead

    View Slide

  40. Questions?

    View Slide

  41. Thanks!

    Twitter - @lusis

    Github – lusis

    View Slide