Slide 1

Slide 1 text

metrics, monitoring, logging mathias meyer, @roidrage http://paperplanes.de

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

problem?

Slide 7

Slide 7 text

no one noticed no one got alerted no automatic recovery

Slide 8

Slide 8 text

it happened to me it happened to you

Slide 9

Slide 9 text

devops shmevops

Slide 10

Slide 10 text

your code, your responsibility

Slide 11

Slide 11 text

what is your application doing right now?

Slide 12

Slide 12 text

do you know when it fails?

Slide 13

Slide 13 text

failure means customers lose trust

Slide 14

Slide 14 text

failure means customers go elsewhere

Slide 15

Slide 15 text

failure means you lose money

Slide 16

Slide 16 text

application = providing value

Slide 17

Slide 17 text

monitoring metrics logging

Slide 18

Slide 18 text

monitoring

Slide 19

Slide 19 text

is the application available?

Slide 20

Slide 20 text

pingdom pagerduty nagios icinga sensu sheriff

Slide 21

Slide 21 text

pingdom

Slide 22

Slide 22 text

http://pingdom.com

Slide 23

Slide 23 text

tcp/ip http(s) ping

Slide 24

Slide 24 text

nagios

Slide 25

Slide 25 text

nagios can check everything

Slide 26

Slide 26 text

it's still terrible

Slide 27

Slide 27 text

http://www.nagios.org/

Slide 28

Slide 28 text

#monitoringsucks

Slide 29

Slide 29 text

sensu http://www.sonian.com/cloud-monitoring-sensu/

Slide 30

Slide 30 text

sheriff https://github.com/dawanda/sheriff

Slide 31

Slide 31 text

monit runit bluepill god upstart

Slide 32

Slide 32 text

is this service currently providing value?

Slide 33

Slide 33 text

is this service consuming too many resources?

Slide 34

Slide 34 text

monit

Slide 35

Slide 35 text

check process unicorn with pidfile /var/run/unicorn/unicorn.pid start program = "/etc/init.d/unicorn start" stop program = "/etc/init.d/unicorn stop" if mem is greater than 300.0 MB for 1 cycles then restart if cpu is greater than 50% for 2 cycles then alert if cpu is greater than 80% for 3 cycles then restart group unicorn http://mmonit.com/monit/

Slide 36

Slide 36 text

bluepill

Slide 37

Slide 37 text

Bluepill.application("unicorn") do |app| app.working_dir = "/var/www/app/current" app.process("unicorn") do |process| process.start_command = "/etc/init.d/unicorn start" process.stop_command = "kill -QUIT {{PID}}" process.restart_command = "kill -USR2 {{PID}}" process.stdout = process.stderr = "/var/www/app/current/log/unicorn.log" process.pid_file = "/var/run/unicorn/unicorn.pid" process.checks :mem_usage, :every => 10.seconds, :below => 300.megabytes, :times => [3, 5] process.start_grace_time = 10.seconds process.start_grace_time = 10.seconds process.restart_grace_time = 10.seconds process.checks :flapping, :times => 2, :within => 30.seconds, :retry_in => 7.seconds process.monitor_children do |cp| cp.checks :mem_usage, :every => 10, :below => 400.megabytes, :times => [3, 5] process.checks :cpu_usage, :every => 10.seconds, :below => 50, :times => 5 cp.stop_command = "kill -QUIT {{PID}}" end end end https://github.com/arya/bluepill

Slide 38

Slide 38 text

runit

Slide 39

Slide 39 text

#!/bin/sh cd /var/www/app/current ./bin/unicorn_rails -c config/unicorn.rb -e production http://smarden.org/runit/

Slide 40

Slide 40 text

metrics

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

measurements historical data graphs

Slide 43

Slide 43 text

how many customers are on my site?

Slide 44

Slide 44 text

how many customers were on my site yesterday?

Slide 45

Slide 45 text

how slow is paypal's api?

Slide 46

Slide 46 text

how slow was paypal's api yesterday?

Slide 47

Slide 47 text

how much memory is available on my servers?

Slide 48

Slide 48 text

how much has memory usage grown over four weeks?

Slide 49

Slide 49 text

number of open database connections number of redis commands number of 500 errors rate of HTTP requests number of HTTP connections median response time

Slide 50

Slide 50 text

number of failed resque jobs number of twitter followers 99th percentile github api response time 95th percentile mysql query time deployments

Slide 51

Slide 51 text

cpu usage incoming network traffic load average disk usage iops

Slide 52

Slide 52 text

munin ganglia graphite scout server density librato metrics

Slide 53

Slide 53 text

munin

Slide 54

Slide 54 text

http://munin-monitoring.org/

Slide 55

Slide 55 text

ganglia

Slide 56

Slide 56 text

http://ganglia.info/

Slide 57

Slide 57 text

#monitoringsucks

Slide 58

Slide 58 text

#rrdtoolsucks

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

access to single data points matters

Slide 61

Slide 61 text

graphite

Slide 62

Slide 62 text

modern graphing not using rrdtool extensible http://graphite.wikidot.com/

Slide 63

Slide 63 text

graphite dashboards

Slide 64

Slide 64 text

https://github.com/ripienaar/gdash

Slide 65

Slide 65 text

https://github.com/paperlesspost/graphiti

Slide 66

Slide 66 text

https://github.com/obfuscurity/tasseo

Slide 67

Slide 67 text

cube & cubism

Slide 68

Slide 68 text

http://square.github.com/cube/

Slide 69

Slide 69 text

commercial tools

Slide 70

Slide 70 text

newrelic http://newrelic.com

Slide 71

Slide 71 text

scout http://scoutapp.com

Slide 72

Slide 72 text

server density http://serverdensity.com

Slide 73

Slide 73 text

boundary

Slide 74

Slide 74 text

http://boundary.com

Slide 75

Slide 75 text

librato metrics

Slide 76

Slide 76 text

metrics as a service resolutions to the second real-time updates

Slide 77

Slide 77 text

http://metrics.librato.com

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

collectd (honorary mention) http://collectd.org

Slide 82

Slide 82 text

riemann (honorary mention)

Slide 83

Slide 83 text

http://aphyr.github.com/riemann/

Slide 84

Slide 84 text

track everything that moves

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

adding metrics should be easy

Slide 87

Slide 87 text

statsd https://github.com/etsy/statsd

Slide 88

Slide 88 text

metriks https://github.com/eric/metriks

Slide 89

Slide 89 text

counters meters timers

Slide 90

Slide 90 text

Metriks.meter("travis.github.requests").mark

Slide 91

Slide 91 text

Metriks.counter("travis.repositories").increment

Slide 92

Slide 92 text

librato metrics log stream graphite proc title

Slide 93

Slide 93 text

percentiles > averages

Slide 94

Slide 94 text

dashboards

Slide 95

Slide 95 text

combine graphs

Slide 96

Slide 96 text

put them up in your office

Slide 97

Slide 97 text

visibility is important

Slide 98

Slide 98 text

logging

Slide 99

Slide 99 text

the papertrail

Slide 100

Slide 100 text

#syslogsucks

Slide 101

Slide 101 text

collect logs from everywhere

Slide 102

Slide 102 text

index, aggregate, analyze

Slide 103

Slide 103 text

grep, awk, sort

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

centralized logging

Slide 106

Slide 106 text

syslog://

Slide 107

Slide 107 text

logstash http://logstash.net/

Slide 108

Slide 108 text

log inputs process outputs

Slide 109

Slide 109 text

graylog

Slide 110

Slide 110 text

http://graylog2.org/

Slide 111

Slide 111 text

loggly

Slide 112

Slide 112 text

http://loggly.com

Slide 113

Slide 113 text

papertrail

Slide 114

Slide 114 text

https://papertrailapp.com/

Slide 115

Slide 115 text

integrates with librato metrics

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

bits and pieces

Slide 118

Slide 118 text

travis metrics

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

https://github.com/eric/metriks_log_webhook

Slide 121

Slide 121 text

lograge

Slide 122

Slide 122 text

sane rails logging

Slide 123

Slide 123 text

No content

Slide 124

Slide 124 text

https://github.com/mattmatt/lograge

Slide 125

Slide 125 text

#monitoringsucksless

Slide 126

Slide 126 text

own your monitoring

Slide 127

Slide 127 text

own your metrics

Slide 128

Slide 128 text

own your logging

Slide 129

Slide 129 text

none of them is optional

Slide 130

Slide 130 text

go forth and correlate

Slide 131

Slide 131 text

http://www.paperplanes.de/2011/1/5/the_virtues_of_monitoring.html http://about.travis-ci.org/blog/2012-04-02-metrics-monitoring-infrastructure-oh-my/ http://pivotallabs.com/talks/139-metrics-metrics-everywhere http://bitmonkey.net/post/18854033582/introducing-metriks http://code.flickr.com/blog/2008/10/27/counting-timing/

Slide 132

Slide 132 text

we're not hiring ❤