Slide 1

Slide 1 text

So.... Monitorama What's THAT all about?

Slide 2

Slide 2 text

“We'll be hearing talks from leading open source developers and web operations luminaries, and then taking what we've learned to apply it towards advancing the state of open source monitoring and trending software.”

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

ALLSPAW AUTHOR AUTHOR LOGSTASH CEPMON An actual fucking scientist RIEMANN GRAPHITE RAILSMACHINE GITHUB SENSU 37signals Heroku Github BOUNDARY PAPERLESS POST ETSY LIBRATO AUTHOR Wrote all the software in the world

Slide 5

Slide 5 text

● Complains a lot on the internet ● Hates Maven ● Won't stop complaining on the internet ● Hates MongoDB ● Does this guy ever stop complaining?

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

#monitoringsucks

Slide 8

Slide 8 text

#monitoringlove

Slide 9

Slide 9 text

“We'll be hearing talks from leading open source developers and web operations luminaries, and then taking what we've learned to apply it towards advancing the state of open source monitoring and trending software.”

Slide 10

Slide 10 text

Security

Slide 11

Slide 11 text

Automation ● Automation is cool and awesome and necessary ● Automate everything! ● Automate our scaling based on our metrics!

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Oh hey! A random UDP packet with metric data. Lemme just automatically launch a few new EC2 instances.....

Slide 14

Slide 14 text

We can no longer ignore the lack of security in our data collection systems. Seriously.

Slide 15

Slide 15 text

Retention

Slide 16

Slide 16 text

Telemetry Data ● Collect ALL the metrics. ● Even the ones we don't understand ● Stuff said on Twitter and Facebook ● Nordic Arachnid Flatulence.

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Oh hey. How's that network saturation looking?

Slide 19

Slide 19 text

Real talk ● Realize you can't store every bit of data you collect forever ● You would likely need more hardware to retain the data about your infrastructure than the infrastructure itself ● Logs are likely still your richest source of data ● Don't collect something just because it's tradition ● We need science not folklore

Slide 20

Slide 20 text

What you need to know ● Is the system doing what it's supposed to? ● Is the business able to do what it's supposed to?

Slide 21

Slide 21 text

Interpretation

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

● I shouldn't need to go to a Tufte seminar to understand why shit is broken. ● We need a GoF for common visualizations of common data points ● Watch out for mud radios

Slide 27

Slide 27 text

Alerting

Slide 28

Slide 28 text

Pager Duty Alert. You have one triggered incident on US PROD N-A-G-I-O-S. The failure is …..

Slide 29

Slide 29 text

The failure is you won't leave me the hell alone

Slide 30

Slide 30 text

Alert fatigue is the single biggest problem we have right now.

Slide 31

Slide 31 text

The Big Mistake One Event – One Alert

Slide 32

Slide 32 text

Things I don't alert on ● Memory usage ● CPU Usage ● Load Average

Slide 33

Slide 33 text

Things I do alert on ● JVM OOMs ● Latency ● Connection pool failures ● Is shit working?

Slide 34

Slide 34 text

Real Talk Part 2 ● Alert on actionable things ● Thresholds are ever-evolving

Slide 35

Slide 35 text

We need to be more intelligent about our alerts or we'll all go insane.

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

A few final thoughts

Slide 38

Slide 38 text

Brain dump ● Rollups ● Event Correlation ● Riemann ● Storm + Esper ● ElasticSearch

Slide 39

Slide 39 text

Things we need to accept ● JSON costs you precision ● The JVM is not the end of the world ● Nagios isn't going anywhere ● Each additional component creates management overhead

Slide 40

Slide 40 text

Questions?

Slide 41

Slide 41 text

Thanks! ● Twitter - @lusis ● Github – lusis