metrics, monitoring, logging

metrics, monitoring, logging mathias meyer, @roidrage http://paperplanes.de

problem?

no one noticed no one got alerted no automatic recovery

it happened to me it happened to you

devops shmevops

your code, your responsibility

what is your application doing right now?

do you know when it fails?

failure means customers lose trust

failure means customers go elsewhere

failure means you lose money

application = providing value

monitoring metrics logging

monitoring

is the application available?

pingdom pagerduty nagios icinga sensu sheriff

pingdom

http://pingdom.com

tcp/ip http(s) ping

nagios

nagios can check everything

it's still terrible

http://www.nagios.org/

#monitoringsucks

sensu http://www.sonian.com/cloud-monitoring-sensu/

sheriff https://github.com/dawanda/sheriff

monit runit bluepill god upstart

is this service currently providing value?

is this service consuming too many resources?

check process unicorn with pidfile /var/run/unicorn/unicorn.pid start program = "/etc/init.d/unicorn
start" stop program = "/etc/init.d/unicorn stop" if mem is greater than 300.0 MB for 1 cycles then restart if cpu is greater than 50% for 2 cycles then alert if cpu is greater than 80% for 3 cycles then restart group unicorn http://mmonit.com/monit/

bluepill

Bluepill.application("unicorn") do |app| app.working_dir = "/var/www/app/current" app.process("unicorn") do |process| process.start_command
= "/etc/init.d/unicorn start" process.stop_command = "kill -QUIT {{PID}}" process.restart_command = "kill -USR2 {{PID}}" process.stdout = process.stderr = "/var/www/app/current/log/unicorn.log" process.pid_file = "/var/run/unicorn/unicorn.pid" process.checks :mem_usage, :every => 10.seconds, :below => 300.megabytes, :times => [3, 5] process.start_grace_time = 10.seconds process.start_grace_time = 10.seconds process.restart_grace_time = 10.seconds process.checks :flapping, :times => 2, :within => 30.seconds, :retry_in => 7.seconds process.monitor_children do |cp| cp.checks :mem_usage, :every => 10, :below => 400.megabytes, :times => [3, 5] process.checks :cpu_usage, :every => 10.seconds, :below => 50, :times => 5 cp.stop_command = "kill -QUIT {{PID}}" end end end https://github.com/arya/bluepill

#!/bin/sh cd /var/www/app/current ./bin/unicorn_rails -c config/unicorn.rb -e production http://smarden.org/runit/

metrics

measurements historical data graphs

how many customers are on my site?

how many customers were on my site yesterday?

how slow is paypal's api?

how slow was paypal's api yesterday?

how much memory is available on my servers?

how much has memory usage grown over four weeks?

number of open database connections number of redis commands number
of 500 errors rate of HTTP requests number of HTTP connections median response time

number of failed resque jobs number of twitter followers 99th
percentile github api response time 95th percentile mysql query time deployments

cpu usage incoming network traffic load average disk usage iops

munin ganglia graphite scout server density librato metrics

http://munin-monitoring.org/

ganglia

http://ganglia.info/

#monitoringsucks

#rrdtoolsucks

access to single data points matters

graphite

modern graphing not using rrdtool extensible http://graphite.wikidot.com/

graphite dashboards

https://github.com/ripienaar/gdash

https://github.com/paperlesspost/graphiti

https://github.com/obfuscurity/tasseo

cube & cubism

http://square.github.com/cube/

commercial tools

newrelic http://newrelic.com

scout http://scoutapp.com

server density http://serverdensity.com

boundary

http://boundary.com

librato metrics

metrics as a service resolutions to the second real-time updates

http://metrics.librato.com

collectd (honorary mention) http://collectd.org

riemann (honorary mention)

http://aphyr.github.com/riemann/

track everything that moves

adding metrics should be easy

statsd https://github.com/etsy/statsd

metriks https://github.com/eric/metriks

counters meters timers

Metriks.meter("travis.github.requests").mark

Metriks.counter("travis.repositories").increment

librato metrics log stream graphite proc title

percentiles > averages

dashboards

combine graphs

put them up in your office

visibility is important

logging

the papertrail

#syslogsucks

collect logs from everywhere

index, aggregate, analyze

grep, awk, sort

centralized logging

syslog://

logstash http://logstash.net/

log inputs process outputs

graylog

http://graylog2.org/

loggly

http://loggly.com

papertrail

https://papertrailapp.com/

integrates with librato metrics

bits and pieces

travis metrics

https://github.com/eric/metriks_log_webhook

lograge

sane rails logging

https://github.com/mattmatt/lograge

#monitoringsucksless

own your monitoring

own your metrics

own your logging

none of them is optional

go forth and correlate

http://www.paperplanes.de/2011/1/5/the_virtues_of_monitoring.html http://about.travis-ci.org/blog/2012-04-02-metrics-monitoring-infrastructure-oh-my/ http://pivotallabs.com/talks/139-metrics-metrics-everywhere http://bitmonkey.net/post/18854033582/introducing-metriks http://code.flickr.com/blog/2008/10/27/counting-timing/

we're not hiring ❤

metrics, monitoring, logging

metrics, monitoring, logging

More Decks by Mathias Meyer

Other Decks in Technology

Featured

Transcript