Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Berlin 2013 - Session - Mark McGranaghan
Search
Monitorama
September 19, 2013
1
330
Berlin 2013 - Session - Mark McGranaghan
Monitorama
September 19, 2013
Tweet
Share
More Decks by Monitorama
See All by Monitorama
Monitorama PDX 2017 - Ian Bennett
monitorama
1
590
PDX 2017 - Pedro Andrade
monitorama
0
760
PDX 2017 - Roy Rapoport
monitorama
4
960
PDX 2017 - Julia Evans
monitorama
0
490
Berlin 2013 - Session - Brad Lhotsky
monitorama
5
730
Berlin 2013 - Session - Alex Petrov
monitorama
6
700
Berlin 2013 - Session - Jeff Weinstein
monitorama
2
640
Berlin 2013 - Session - Oliver Hankeln
monitorama
1
550
Berlin 2013 - Session - David Goodlad
monitorama
0
470
Featured
See All Featured
A better future with KSS
kneath
239
18k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
How To Stay Up To Date on Web Technology
chriscoyier
791
250k
Context Engineering - Making Every Token Count
addyosmani
8
310
Speed Design
sergeychernyshev
32
1.2k
We Have a Design System, Now What?
morganepeng
53
7.8k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
16
1.7k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
9
930
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.7k
Statistics for Hackers
jakevdp
799
220k
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
2
98
How GitHub (no longer) Works
holman
315
140k
Transcript
Fewer Better Systems Monitorama EU 2013 Mark McGranaghan
@mmcgrana
Fewer Better Systems
Unix
everything is a file
/var/db /usr/lib /dev/tcp /usr/bin /etc
/dev/tcp
problem problem problem
everything is a ...
failover
primary secondary
primary secondary
primary secondary
primary secondary?
https://twitter.com/b6n/status/161899319459463168
the best systems are used constantly
Fewer Better Systems
everything is a ...
the best systems are used constantly
logs / events alert criteria / metrics integration testing /
QoS monitoring errors / results
logs / events
logs: stream of unstructured information events: stream of structured information
logs 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200 [notice] SQL
(0.5ms) SELECT users Completed in 64ms (View: 52, DB: 10) | 200 OK [/users/7]
invent ways to encode data in text...
data "data" | data <data> - data [data] (data)
meanwhile...
Apache log parsers / analyzers Postgres log parses / analyzers
Redis log parsers / analyzers Heroku log parsers / analyzers ...
everything is a ...
events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }
64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200
events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }
encode data as data, uniformly
analyze with general tools
open source
http://fluentd.org
{ :time "2013-09-19 10:27:39", :tag "web.request", :record { :ip "64.242.88.10",
:path "/users/7", ... } }
Web apps ---+ +--> file | | +--> ---+ /var/log
------> Fluentd ------> mail +--> ---+ | | Apache ---- +--> Fluentd http://fluentd.org
problem problem problem
something happened at some time: event events as data, not
text general-purpose event processing applicable to all information
everything is a ...
alert criteria / metrics
alert criteria: measure, alert if out of bounds metrics: measure,
store for analysis
measure measure alert store
measure measure alert store steady-state
measure measure alert store alert!
measure measure alert store steady-state
measure alert store
production
None
every alert has time series alter time series come from
metrics stack alert source data stored all the time
the best systems are used constantly
integration testing / QoS monitoring
https://plus.google.com/112678702228711889851/posts/eVeouesvaVX
integration testing: is good for production? QoS monitoring: is it
good in production?
integration testing run through common user flows, assert no errors,
ensure performance adequate
quality of service (QoS) monitoring users running through flows asserting
no/minimal errors, ensuring performance adequate
integration prod staging user load QoS monitoring
Integration prod staging user load QoS monitoring
staging prod user load load gen QoS monitoring QoS monitoring
load gen
invest in load generation/replay invest in granular QoS monitoring applicable
to all environments, all the time
the best systems are used constantly
errors / results
raise(“it’s tricky”)
errors: something happened, it was bad results: something happened, it
was OK
begin res = call_fn(arg) # handle result rescue => err
# handle error end
None
None
exceptions are only exceptional at small scale “1 in a
billion” @ 100k op/s ≃ 10 times a day
begin res = call_fn(arg) # handle result rescue => err
# handle error end
open source
http://golang.org
http://golang.org res, err := RunOp(arg) if err != nil {
// handle error } // handle result
begin res = run_op(arg) # handle result rescue => err
# handle error end
locality? in general: not local in space - service-level errors
etc not local in time - defined post hoc!
what even is an error? you don’t know at dev-time
when it’s just a result... emit event for later analysis
treat “exceptions” / results symmetrically to the greatest extent possible
expect to define errors at analysis-time, not just dev-time or run-time, based on results
everything is a ...
logs / events / metrics alert criteria / metrics integration
testing / QoS monitoring errors / results
a challenge
everything is a ...
the best systems are used constantly
Fewer Better Systems