“We'll be hearing talks from leading open source
developers and web operations luminaries, and then
taking what we've learned to apply it towards advancing
the state of open source monitoring and trending
software.”
Slide 3
Slide 3 text
No content
Slide 4
Slide 4 text
ALLSPAW
AUTHOR
AUTHOR
LOGSTASH
CEPMON
An actual
fucking
scientist
RIEMANN
GRAPHITE
RAILSMACHINE
GITHUB
SENSU
37signals
Heroku
Github
BOUNDARY
PAPERLESS
POST
ETSY
LIBRATO
AUTHOR
Wrote all
the
software
in the
world
Slide 5
Slide 5 text
●
Complains a lot on the internet
●
Hates Maven
●
Won't stop complaining on the internet
●
Hates MongoDB
●
Does this guy ever stop complaining?
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
#monitoringsucks
Slide 8
Slide 8 text
#monitoringlove
Slide 9
Slide 9 text
“We'll be hearing talks from leading open source developers
and web operations luminaries, and then taking what we've
learned to apply it towards advancing the state of open source
monitoring and trending software.”
Slide 10
Slide 10 text
Security
Slide 11
Slide 11 text
Automation
●
Automation is cool and awesome and necessary
●
Automate everything!
●
Automate our scaling based on our metrics!
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
Oh hey! A random UDP packet with metric data.
Lemme just automatically launch a few new EC2 instances.....
Slide 14
Slide 14 text
We can no longer ignore the lack of security in our data
collection systems. Seriously.
Slide 15
Slide 15 text
Retention
Slide 16
Slide 16 text
Telemetry Data
●
Collect ALL the metrics.
●
Even the ones we don't understand
●
Stuff said on Twitter and Facebook
●
Nordic Arachnid Flatulence.
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
Oh hey. How's that network saturation looking?
Slide 19
Slide 19 text
Real talk
●
Realize you can't store every bit of data you collect
forever
●
You would likely need more hardware to retain the data
about your infrastructure than the infrastructure itself
●
Logs are likely still your richest source of data
●
Don't collect something just because it's tradition
●
We need science not folklore
Slide 20
Slide 20 text
What you need to know
●
Is the system doing what it's supposed to?
●
Is the business able to do what it's supposed to?
Slide 21
Slide 21 text
Interpretation
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
●
I shouldn't need to go to a Tufte seminar to understand
why shit is broken.
●
We need a GoF for common visualizations of common
data points
●
Watch out for mud radios
Slide 27
Slide 27 text
Alerting
Slide 28
Slide 28 text
Pager Duty Alert. You have one triggered incident on US
PROD N-A-G-I-O-S. The failure is …..
Slide 29
Slide 29 text
The failure is you won't leave me the hell alone
Slide 30
Slide 30 text
Alert fatigue is the single biggest problem we have right
now.
Slide 31
Slide 31 text
The Big Mistake
One Event – One Alert
Slide 32
Slide 32 text
Things I don't alert on
●
Memory usage
●
CPU Usage
●
Load Average
Slide 33
Slide 33 text
Things I do alert on
●
JVM OOMs
●
Latency
●
Connection pool failures
●
Is shit working?
Slide 34
Slide 34 text
Real Talk Part 2
●
Alert on actionable things
●
Thresholds are ever-evolving
Slide 35
Slide 35 text
We need to be more intelligent about our alerts or we'll all go
insane.
Things we need to accept
●
JSON costs you precision
●
The JVM is not the end of the world
●
Nagios isn't going anywhere
●
Each additional component creates management
overhead