Upgrade to Pro — share decks privately, control downloads, hide ads and more …

metrics, monitoring, logging

metrics, monitoring, logging

Site: http://paperplanes.de
Travis CI: http://travis-ci.org
Riak Handbook: http://riakhandbook.com
Pingdom: http://pingdom.com
Nagios: http://nagios.org
Sensu: http://www.sonian.com/cloud-monitoring-sensu/
Sheriff: https://github.com/dawanda/sheriff
Monit: http://mmonit.com/monit/
Bluepill: https://github.com/arya/bluepill
Runit: http://smarden.org/runit/
Munin: http://munin-monitoring.org/
Ganglia: http://ganglia.info/
Graphite: http://graphite.wikidot.com
GDash: https://github.com/ripienaar/gdash
Graphiti: https://github.com/paperlesspost/graphiti
Tasseo: https://github.com/obfuscurity/tasseo
Cube: http://square.github.com/cube/
Cubism: http://square.github.com/cubism/
NewRelic: http://newrelic.com
Scout: http://scoutapp.com
Server Density: http://serverdensity.com
Boundary: http://boundary.com
Librato Metrics: http://metrics.librato.com
Riemann: http://aphyr.github.com/riemann/
StatsD: https://github.com/etsy/statsd
Metriks: https://github.com/eric/metriks
Logstash: http://logstash.net/
Graylog: http://graylog2.org/
Loggly: http://loggly.com
Papertrail: https://papertrailapp.com/
Metriks Log Webhook: https://github.com/eric/metriks_log_webhook
Lograge: https://github.com/mattmatt/lograge

Further reading:
http://www.paperplanes.de/2011/1/5/the_virtues_of_monitoring.html
http://about.travis-ci.org/blog/2012-04-02-metrics-monitoring-infrastructure-oh-my/
http://pivotallabs.com/talks/139-metrics-metrics-everywhere
http://bitmonkey.net/post/18854033582/introducing-metriks
http://code.flickr.com/blog/2008/10/27/counting-timing/

Mathias Meyer

May 04, 2012
Tweet

More Decks by Mathias Meyer

Other Decks in Technology

Transcript

  1. metrics,
    monitoring,
    logging
    mathias meyer, @roidrage
    http://paperplanes.de

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. problem?

    View Slide

  7. no one noticed
    no one got alerted
    no automatic recovery

    View Slide

  8. it happened to me
    it happened to you

    View Slide

  9. devops shmevops

    View Slide

  10. your code, your responsibility

    View Slide

  11. what is your application doing right now?

    View Slide

  12. do you know when it fails?

    View Slide

  13. failure means customers lose trust

    View Slide

  14. failure means customers go elsewhere

    View Slide

  15. failure means you lose money

    View Slide

  16. application = providing value

    View Slide

  17. monitoring
    metrics
    logging

    View Slide

  18. monitoring

    View Slide

  19. is the application available?

    View Slide

  20. pingdom
    pagerduty
    nagios
    icinga
    sensu
    sheriff

    View Slide

  21. pingdom

    View Slide

  22. http://pingdom.com

    View Slide

  23. tcp/ip
    http(s)
    ping

    View Slide

  24. nagios

    View Slide

  25. nagios can check everything

    View Slide

  26. it's still terrible

    View Slide

  27. http://www.nagios.org/

    View Slide

  28. #monitoringsucks

    View Slide

  29. sensu
    http://www.sonian.com/cloud-monitoring-sensu/

    View Slide

  30. sheriff
    https://github.com/dawanda/sheriff

    View Slide

  31. monit
    runit
    bluepill
    god
    upstart

    View Slide

  32. is this service currently providing value?

    View Slide

  33. is this service consuming too many resources?

    View Slide

  34. monit

    View Slide

  35. check process unicorn
    with pidfile /var/run/unicorn/unicorn.pid
    start program = "/etc/init.d/unicorn start"
    stop program = "/etc/init.d/unicorn stop"
    if mem is greater than 300.0 MB for 1 cycles then restart
    if cpu is greater than 50% for 2 cycles then alert
    if cpu is greater than 80% for 3 cycles then restart
    group unicorn
    http://mmonit.com/monit/

    View Slide

  36. bluepill

    View Slide

  37. Bluepill.application("unicorn") do |app|
    app.working_dir = "/var/www/app/current"
    app.process("unicorn") do |process|
    process.start_command = "/etc/init.d/unicorn start"
    process.stop_command = "kill -QUIT {{PID}}"
    process.restart_command = "kill -USR2 {{PID}}"
    process.stdout = process.stderr = "/var/www/app/current/log/unicorn.log"
    process.pid_file = "/var/run/unicorn/unicorn.pid"
    process.checks :mem_usage, :every => 10.seconds, :below => 300.megabytes, :times => [3, 5]
    process.start_grace_time = 10.seconds
    process.start_grace_time = 10.seconds
    process.restart_grace_time = 10.seconds
    process.checks :flapping, :times => 2, :within => 30.seconds, :retry_in => 7.seconds
    process.monitor_children do |cp|
    cp.checks :mem_usage, :every => 10, :below => 400.megabytes, :times => [3, 5]
    process.checks :cpu_usage, :every => 10.seconds, :below => 50, :times => 5
    cp.stop_command = "kill -QUIT {{PID}}"
    end
    end
    end
    https://github.com/arya/bluepill

    View Slide

  38. runit

    View Slide

  39. #!/bin/sh
    cd /var/www/app/current
    ./bin/unicorn_rails -c config/unicorn.rb -e production
    http://smarden.org/runit/

    View Slide

  40. metrics

    View Slide

  41. View Slide

  42. measurements
    historical data
    graphs

    View Slide

  43. how many customers are on my site?

    View Slide

  44. how many customers were on my site yesterday?

    View Slide

  45. how slow is paypal's api?

    View Slide

  46. how slow was paypal's api yesterday?

    View Slide

  47. how much memory is available on my servers?

    View Slide

  48. how much has memory usage grown over four weeks?

    View Slide

  49. number of open database connections
    number of redis commands
    number of 500 errors
    rate of HTTP requests
    number of HTTP connections
    median response time

    View Slide

  50. number of failed resque jobs
    number of twitter followers
    99th percentile github api response time
    95th percentile mysql query time
    deployments

    View Slide

  51. cpu usage
    incoming network traffic
    load average
    disk usage
    iops

    View Slide

  52. munin
    ganglia
    graphite
    scout
    server density
    librato metrics

    View Slide

  53. munin

    View Slide

  54. http://munin-monitoring.org/

    View Slide

  55. ganglia

    View Slide

  56. http://ganglia.info/

    View Slide

  57. #monitoringsucks

    View Slide

  58. #rrdtoolsucks

    View Slide

  59. View Slide

  60. access to single data points matters

    View Slide

  61. graphite

    View Slide

  62. modern graphing
    not using rrdtool
    extensible
    http://graphite.wikidot.com/

    View Slide

  63. graphite dashboards

    View Slide

  64. https://github.com/ripienaar/gdash

    View Slide

  65. https://github.com/paperlesspost/graphiti

    View Slide

  66. https://github.com/obfuscurity/tasseo

    View Slide

  67. cube & cubism

    View Slide

  68. http://square.github.com/cube/

    View Slide

  69. commercial tools

    View Slide

  70. newrelic
    http://newrelic.com

    View Slide

  71. scout
    http://scoutapp.com

    View Slide

  72. server density
    http://serverdensity.com

    View Slide

  73. boundary

    View Slide

  74. http://boundary.com

    View Slide

  75. librato metrics

    View Slide

  76. metrics as a service
    resolutions to the second
    real-time updates

    View Slide

  77. http://metrics.librato.com

    View Slide

  78. View Slide

  79. View Slide

  80. View Slide

  81. collectd (honorary mention)
    http://collectd.org

    View Slide

  82. riemann (honorary mention)

    View Slide

  83. http://aphyr.github.com/riemann/

    View Slide

  84. track everything that moves

    View Slide

  85. View Slide

  86. adding metrics should be easy

    View Slide

  87. statsd
    https://github.com/etsy/statsd

    View Slide

  88. metriks
    https://github.com/eric/metriks

    View Slide

  89. counters
    meters
    timers

    View Slide

  90. Metriks.meter("travis.github.requests").mark

    View Slide

  91. Metriks.counter("travis.repositories").increment

    View Slide

  92. librato metrics
    log stream
    graphite
    proc title

    View Slide

  93. percentiles > averages

    View Slide

  94. dashboards

    View Slide

  95. combine graphs

    View Slide

  96. put them up in your office

    View Slide

  97. visibility is important

    View Slide

  98. logging

    View Slide

  99. the papertrail

    View Slide

  100. #syslogsucks

    View Slide

  101. collect logs from everywhere

    View Slide

  102. index, aggregate, analyze

    View Slide

  103. grep, awk, sort

    View Slide

  104. View Slide

  105. centralized logging

    View Slide

  106. syslog://

    View Slide

  107. logstash
    http://logstash.net/

    View Slide

  108. log inputs
    process
    outputs

    View Slide

  109. graylog

    View Slide

  110. http://graylog2.org/

    View Slide

  111. loggly

    View Slide

  112. http://loggly.com

    View Slide

  113. papertrail

    View Slide

  114. https://papertrailapp.com/

    View Slide

  115. integrates with librato metrics

    View Slide

  116. View Slide

  117. bits and pieces

    View Slide

  118. travis metrics

    View Slide

  119. View Slide

  120. https://github.com/eric/metriks_log_webhook

    View Slide

  121. lograge

    View Slide

  122. sane rails logging

    View Slide

  123. View Slide

  124. https://github.com/mattmatt/lograge

    View Slide

  125. #monitoringsucksless

    View Slide

  126. own your monitoring

    View Slide

  127. own your metrics

    View Slide

  128. own your logging

    View Slide

  129. none of them is optional

    View Slide

  130. go forth and correlate

    View Slide

  131. http://www.paperplanes.de/2011/1/5/the_virtues_of_monitoring.html
    http://about.travis-ci.org/blog/2012-04-02-metrics-monitoring-infrastructure-oh-my/
    http://pivotallabs.com/talks/139-metrics-metrics-everywhere
    http://bitmonkey.net/post/18854033582/introducing-metriks
    http://code.flickr.com/blog/2008/10/27/counting-timing/

    View Slide

  132. we're not hiring ❤

    View Slide