Fewer Better SystemsMonitorama EU 2013Mark McGranaghan
View Slide
@mmcgrana
Fewer Better Systems
Unix
everything is a file
/var/db/usr/lib/dev/tcp/usr/bin/etc
/dev/tcp
problemproblemproblem
everything is a ...
failover
primary secondary
primarysecondary
primarysecondary?
https://twitter.com/b6n/status/161899319459463168
the best systems are usedconstantly
logs / eventsalert criteria / metricsintegration testing / QoS monitoringerrors / results
logs / events
logs: stream of unstructured informationevents: stream of structured information
logs64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200[notice] SQL (0.5ms) SELECT usersCompleted in 64ms (View: 52, DB: 10) | 200 OK [/users/7]
invent ways to encode data in text...
data "data" | data - data [data] (data)
meanwhile...
Apache log parsers / analyzersPostgres log parses / analyzersRedis log parsers / analyzersHeroku log parsers / analyzers...
events{:time "2013-09-19 10:27:39":action "users.get":user_id 7:method "GET":path "/users/7":ip "64.242.88.10"...}
64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200
encode data as data, uniformly
analyze with general tools
open source
http://fluentd.org
{:time "2013-09-19 10:27:39",:tag "web.request",:record {:ip "64.242.88.10",:path "/users/7",...}}
Web apps ---+ +--> file| |+--> ---+/var/log ------> Fluentd ------> mail+--> ---+| |Apache ---- +--> Fluentdhttp://fluentd.org
something happened at some time:eventevents as data, not textgeneral-purpose event processingapplicable to all information
alert criteria / metrics
alert criteria: measure, alert if out of boundsmetrics: measure, store for analysis
measure measurealert store
measure measurealert storesteady-state
measure measurealert storealert!
measurealertstore
production
every alert has time seriesalter time series come from metrics stackalert source data stored all the time
integration testing / QoS monitoring
https://plus.google.com/112678702228711889851/posts/eVeouesvaVX
integration testing: is good for production?QoS monitoring: is it good in production?
integration testingrun through common user flows,assert no errors,ensure performance adequate
quality of service (QoS) monitoringusers running through flowsasserting no/minimal errors,ensuring performance adequate
integrationprodstaginguser loadQoS monitoring
Integrationprodstaginguser loadQoS monitoring
staging produser loadloadgenQoS monitoringQoSmonitoringload gen
invest in load generation/replayinvest in granular QoS monitoringapplicable to all environments, all the time
errors / results
raise(“it’s tricky”)
errors: something happened, it was badresults: something happened, it was OK
beginres = call_fn(arg)# handle resultrescue => err# handle errorend
exceptions are only exceptionalat small scale“1 in a billion” @ 100k op/s ≃ 10 times a day
http://golang.org
http://golang.orgres, err := RunOp(arg)if err != nil {// handle error}// handle result
beginres = run_op(arg)# handle resultrescue => err# handle errorend
locality?in general:not local in space - service-level errors etcnot local in time - defined post hoc!
what even is an error?you don’t know at dev-timewhen it’s just a result...emit event for later analysis
treat “exceptions” / results symmetricallyto the greatest extent possibleexpect to define errors at analysis-time,not just dev-time or run-time,based on results
logs / events / metricsalert criteria / metricsintegration testing / QoS monitoringerrors / results
a challenge