metrics, monitoring, logging

metrics, monitoring, logging

Site: http://paperplanes.de
Travis CI: http://travis-ci.org
Riak Handbook: http://riakhandbook.com
Pingdom: http://pingdom.com
Nagios: http://nagios.org
Sensu: http://www.sonian.com/cloud-monitoring-sensu/
Sheriff: https://github.com/dawanda/sheriff
Monit: http://mmonit.com/monit/
Bluepill: https://github.com/arya/bluepill
Runit: http://smarden.org/runit/
Munin: http://munin-monitoring.org/
Ganglia: http://ganglia.info/
Graphite: http://graphite.wikidot.com
GDash: https://github.com/ripienaar/gdash
Graphiti: https://github.com/paperlesspost/graphiti
Tasseo: https://github.com/obfuscurity/tasseo
Cube: http://square.github.com/cube/
Cubism: http://square.github.com/cubism/
NewRelic: http://newrelic.com
Scout: http://scoutapp.com
Server Density: http://serverdensity.com
Boundary: http://boundary.com
Librato Metrics: http://metrics.librato.com
Riemann: http://aphyr.github.com/riemann/
StatsD: https://github.com/etsy/statsd
Metriks: https://github.com/eric/metriks
Logstash: http://logstash.net/
Graylog: http://graylog2.org/
Loggly: http://loggly.com
Papertrail: https://papertrailapp.com/
Metriks Log Webhook: https://github.com/eric/metriks_log_webhook
Lograge: https://github.com/mattmatt/lograge

Further reading:
http://www.paperplanes.de/2011/1/5/the_virtues_of_monitoring.html
http://about.travis-ci.org/blog/2012-04-02-metrics-monitoring-infrastructure-oh-my/
http://pivotallabs.com/talks/139-metrics-metrics-everywhere
http://bitmonkey.net/post/18854033582/introducing-metriks
http://code.flickr.com/blog/2008/10/27/counting-timing/

4d9dd9bd8d3d4d0ba8af2acc41d14006?s=128

Mathias Meyer

May 04, 2012
Tweet

Transcript

  1. metrics, monitoring, logging mathias meyer, @roidrage http://paperplanes.de

  2. None
  3. None
  4. None
  5. None
  6. problem?

  7. no one noticed no one got alerted no automatic recovery

  8. it happened to me it happened to you

  9. devops shmevops

  10. your code, your responsibility

  11. what is your application doing right now?

  12. do you know when it fails?

  13. failure means customers lose trust

  14. failure means customers go elsewhere

  15. failure means you lose money

  16. application = providing value

  17. monitoring metrics logging

  18. monitoring

  19. is the application available?

  20. pingdom pagerduty nagios icinga sensu sheriff

  21. pingdom

  22. http://pingdom.com

  23. tcp/ip http(s) ping

  24. nagios

  25. nagios can check everything

  26. it's still terrible

  27. http://www.nagios.org/

  28. #monitoringsucks

  29. sensu http://www.sonian.com/cloud-monitoring-sensu/

  30. sheriff https://github.com/dawanda/sheriff

  31. monit runit bluepill god upstart

  32. is this service currently providing value?

  33. is this service consuming too many resources?

  34. monit

  35. check process unicorn with pidfile /var/run/unicorn/unicorn.pid start program = "/etc/init.d/unicorn

    start" stop program = "/etc/init.d/unicorn stop" if mem is greater than 300.0 MB for 1 cycles then restart if cpu is greater than 50% for 2 cycles then alert if cpu is greater than 80% for 3 cycles then restart group unicorn http://mmonit.com/monit/
  36. bluepill

  37. Bluepill.application("unicorn") do |app| app.working_dir = "/var/www/app/current" app.process("unicorn") do |process| process.start_command

    = "/etc/init.d/unicorn start" process.stop_command = "kill -QUIT {{PID}}" process.restart_command = "kill -USR2 {{PID}}" process.stdout = process.stderr = "/var/www/app/current/log/unicorn.log" process.pid_file = "/var/run/unicorn/unicorn.pid" process.checks :mem_usage, :every => 10.seconds, :below => 300.megabytes, :times => [3, 5] process.start_grace_time = 10.seconds process.start_grace_time = 10.seconds process.restart_grace_time = 10.seconds process.checks :flapping, :times => 2, :within => 30.seconds, :retry_in => 7.seconds process.monitor_children do |cp| cp.checks :mem_usage, :every => 10, :below => 400.megabytes, :times => [3, 5] process.checks :cpu_usage, :every => 10.seconds, :below => 50, :times => 5 cp.stop_command = "kill -QUIT {{PID}}" end end end https://github.com/arya/bluepill
  38. runit

  39. #!/bin/sh cd /var/www/app/current ./bin/unicorn_rails -c config/unicorn.rb -e production http://smarden.org/runit/

  40. metrics

  41. None
  42. measurements historical data graphs

  43. how many customers are on my site?

  44. how many customers were on my site yesterday?

  45. how slow is paypal's api?

  46. how slow was paypal's api yesterday?

  47. how much memory is available on my servers?

  48. how much has memory usage grown over four weeks?

  49. number of open database connections number of redis commands number

    of 500 errors rate of HTTP requests number of HTTP connections median response time
  50. number of failed resque jobs number of twitter followers 99th

    percentile github api response time 95th percentile mysql query time deployments
  51. cpu usage incoming network traffic load average disk usage iops

  52. munin ganglia graphite scout server density librato metrics

  53. munin

  54. http://munin-monitoring.org/

  55. ganglia

  56. http://ganglia.info/

  57. #monitoringsucks

  58. #rrdtoolsucks

  59. None
  60. access to single data points matters

  61. graphite

  62. modern graphing not using rrdtool extensible http://graphite.wikidot.com/

  63. graphite dashboards

  64. https://github.com/ripienaar/gdash

  65. https://github.com/paperlesspost/graphiti

  66. https://github.com/obfuscurity/tasseo

  67. cube & cubism

  68. http://square.github.com/cube/

  69. commercial tools

  70. newrelic http://newrelic.com

  71. scout http://scoutapp.com

  72. server density http://serverdensity.com

  73. boundary

  74. http://boundary.com

  75. librato metrics

  76. metrics as a service resolutions to the second real-time updates

  77. http://metrics.librato.com

  78. None
  79. None
  80. None
  81. collectd (honorary mention) http://collectd.org

  82. riemann (honorary mention)

  83. http://aphyr.github.com/riemann/

  84. track everything that moves

  85. None
  86. adding metrics should be easy

  87. statsd https://github.com/etsy/statsd

  88. metriks https://github.com/eric/metriks

  89. counters meters timers

  90. Metriks.meter("travis.github.requests").mark

  91. Metriks.counter("travis.repositories").increment

  92. librato metrics log stream graphite proc title

  93. percentiles > averages

  94. dashboards

  95. combine graphs

  96. put them up in your office

  97. visibility is important

  98. logging

  99. the papertrail

  100. #syslogsucks

  101. collect logs from everywhere

  102. index, aggregate, analyze

  103. grep, awk, sort

  104. None
  105. centralized logging

  106. syslog://

  107. logstash http://logstash.net/

  108. log inputs process outputs

  109. graylog

  110. http://graylog2.org/

  111. loggly

  112. http://loggly.com

  113. papertrail

  114. https://papertrailapp.com/

  115. integrates with librato metrics

  116. None
  117. bits and pieces

  118. travis metrics

  119. None
  120. https://github.com/eric/metriks_log_webhook

  121. lograge

  122. sane rails logging

  123. None
  124. https://github.com/mattmatt/lograge

  125. #monitoringsucksless

  126. own your monitoring

  127. own your metrics

  128. own your logging

  129. none of them is optional

  130. go forth and correlate

  131. http://www.paperplanes.de/2011/1/5/the_virtues_of_monitoring.html http://about.travis-ci.org/blog/2012-04-02-metrics-monitoring-infrastructure-oh-my/ http://pivotallabs.com/talks/139-metrics-metrics-everywhere http://bitmonkey.net/post/18854033582/introducing-metriks http://code.flickr.com/blog/2008/10/27/counting-timing/

  132. we're not hiring ❤