Berlin 2013 - Session - Mark McGranaghan

Fewer Better Systems Monitorama EU 2013 Mark McGranaghan

@mmcgrana

Fewer Better Systems

everything is a file

/var/db /usr/lib /dev/tcp /usr/bin /etc

/dev/tcp

problem problem problem

everything is a ...

failover

primary secondary

primary secondary?

https://twitter.com/b6n/status/161899319459463168

the best systems are used constantly

everything is a ...

logs / events alert criteria / metrics integration testing /
QoS monitoring errors / results

logs / events

logs: stream of unstructured information events: stream of structured information

logs 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200 [notice] SQL
(0.5ms) SELECT users Completed in 64ms (View: 52, DB: 10) | 200 OK [/users/7]

invent ways to encode data in text...

data "data" | data <data> - data [data] (data)

meanwhile...

Apache log parsers / analyzers Postgres log parses / analyzers
Redis log parsers / analyzers Heroku log parsers / analyzers ...

everything is a ...

events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }

64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200

events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }

encode data as data, uniformly

analyze with general tools

open source

http://fluentd.org

{ :time "2013-09-19 10:27:39", :tag "web.request", :record { :ip "64.242.88.10",
:path "/users/7", ... } }

Web apps ---+ +--> file | | +--> ---+ /var/log
------> Fluentd ------> mail +--> ---+ | | Apache ---- +--> Fluentd http://fluentd.org

problem problem problem

something happened at some time: event events as data, not
text general-purpose event processing applicable to all information

everything is a ...

alert criteria / metrics

alert criteria: measure, alert if out of bounds metrics: measure,
store for analysis

measure measure alert store

measure measure alert store steady-state

measure measure alert store alert!

measure measure alert store steady-state

measure alert store

production

every alert has time series alter time series come from
metrics stack alert source data stored all the time

integration testing / QoS monitoring

https://plus.google.com/112678702228711889851/posts/eVeouesvaVX

integration testing: is good for production? QoS monitoring: is it
good in production?

integration testing run through common user flows, assert no errors,
ensure performance adequate

quality of service (QoS) monitoring users running through flows asserting
no/minimal errors, ensuring performance adequate

integration prod staging user load QoS monitoring

Integration prod staging user load QoS monitoring

staging prod user load load gen QoS monitoring QoS monitoring
load gen

invest in load generation/replay invest in granular QoS monitoring applicable
to all environments, all the time

errors / results

raise(“it’s tricky”)

errors: something happened, it was bad results: something happened, it
was OK

begin res = call_fn(arg) # handle result rescue => err
# handle error end

exceptions are only exceptional at small scale “1 in a
billion” @ 100k op/s ≃ 10 times a day

begin res = call_fn(arg) # handle result rescue => err
# handle error end

open source

http://golang.org

http://golang.org res, err := RunOp(arg) if err != nil {
// handle error } // handle result

begin res = run_op(arg) # handle result rescue => err
# handle error end

locality? in general: not local in space - service-level errors
etc not local in time - defined post hoc!

what even is an error? you don’t know at dev-time
when it’s just a result... emit event for later analysis

treat “exceptions” / results symmetrically to the greatest extent possible
expect to define errors at analysis-time, not just dev-time or run-time, based on results

everything is a ...

logs / events / metrics alert criteria / metrics integration
testing / QoS monitoring errors / results

a challenge

everything is a ...

Berlin 2013 - Session - Mark McGranaghan

Berlin 2013 - Session - Mark McGranaghan

More Decks by Monitorama

Featured

Transcript