Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Berlin 2013 - Session - Mark McGranaghan

Monitorama
September 19, 2013
270

Berlin 2013 - Session - Mark McGranaghan

Monitorama

September 19, 2013
Tweet

Transcript

  1. logs 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200 [notice] SQL

    (0.5ms) SELECT users Completed in 64ms (View: 52, DB: 10) | 200 OK [/users/7]
  2. Apache log parsers / analyzers Postgres log parses / analyzers

    Redis log parsers / analyzers Heroku log parsers / analyzers ...
  3. events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method

    "GET" :path "/users/7" :ip "64.242.88.10" ... }
  4. events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method

    "GET" :path "/users/7" :ip "64.242.88.10" ... }
  5. Web apps ---+ +--> file | | +--> ---+ /var/log

    ------> Fluentd ------> mail +--> ---+ | | Apache ---- +--> Fluentd http://fluentd.org
  6. something happened at some time: event events as data, not

    text general-purpose event processing applicable to all information
  7. every alert has time series alter time series come from

    metrics stack alert source data stored all the time
  8. quality of service (QoS) monitoring users running through flows asserting

    no/minimal errors, ensuring performance adequate
  9. exceptions are only exceptional at small scale “1 in a

    billion” @ 100k op/s ≃ 10 times a day
  10. locality? in general: not local in space - service-level errors

    etc not local in time - defined post hoc!
  11. what even is an error? you don’t know at dev-time

    when it’s just a result... emit event for later analysis
  12. treat “exceptions” / results symmetrically to the greatest extent possible

    expect to define errors at analysis-time, not just dev-time or run-time, based on results
  13. logs / events / metrics alert criteria / metrics integration

    testing / QoS monitoring errors / results