What's THAT all about?
“We'll be hearing talks from leading open source
developers and web operations luminaries, and then
taking what we've learned to apply it towards advancing
the state of open source monitoring and trending
Complains a lot on the internet
Won't stop complaining on the internet
Does this guy ever stop complaining?
“We'll be hearing talks from leading open source developers
and web operations luminaries, and then taking what we've
learned to apply it towards advancing the state of open source
monitoring and trending software.”
Automation is cool and awesome and necessary
Automate our scaling based on our metrics!
Oh hey! A random UDP packet with metric data.
Lemme just automatically launch a few new EC2 instances.....
We can no longer ignore the lack of security in our data
collection systems. Seriously.
Collect ALL the metrics.
Even the ones we don't understand
Stuff said on Twitter and Facebook
Nordic Arachnid Flatulence.
Oh hey. How's that network saturation looking?
Realize you can't store every bit of data you collect
You would likely need more hardware to retain the data
about your infrastructure than the infrastructure itself
Logs are likely still your richest source of data
Don't collect something just because it's tradition
We need science not folklore
What you need to know
Is the system doing what it's supposed to?
Is the business able to do what it's supposed to?
I shouldn't need to go to a Tufte seminar to understand
why shit is broken.
We need a GoF for common visualizations of common
Watch out for mud radios
Pager Duty Alert. You have one triggered incident on US
PROD N-A-G-I-O-S. The failure is …..
The failure is you won't leave me the hell alone
Alert fatigue is the single biggest problem we have right
The Big Mistake
One Event – One Alert
Things I don't alert on
Things I do alert on
Connection pool failures
Is shit working?
Real Talk Part 2
Alert on actionable things
Thresholds are ever-evolving
We need to be more intelligent about our alerts or we'll all go
A few final thoughts
Storm + Esper
Things we need to accept
JSON costs you precision
The JVM is not the end of the world
Nagios isn't going anywhere
Each additional component creates management
Twitter - @lusis
Github – lusis