Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Berlin 2013 - Session - Mark McGranaghan
Search
Monitorama
September 19, 2013
1
320
Berlin 2013 - Session - Mark McGranaghan
Monitorama
September 19, 2013
Tweet
Share
More Decks by Monitorama
See All by Monitorama
Monitorama PDX 2017 - Ian Bennett
monitorama
1
580
PDX 2017 - Pedro Andrade
monitorama
0
730
PDX 2017 - Roy Rapoport
monitorama
4
940
PDX 2017 - Julia Evans
monitorama
0
470
Berlin 2013 - Session - Brad Lhotsky
monitorama
5
710
Berlin 2013 - Session - Alex Petrov
monitorama
6
680
Berlin 2013 - Session - Jeff Weinstein
monitorama
2
620
Berlin 2013 - Session - Oliver Hankeln
monitorama
1
540
Berlin 2013 - Session - David Goodlad
monitorama
0
450
Featured
See All Featured
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.5k
Fantastic passwords and where to find them - at NoRuKo
philnash
51
3.3k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
44
2.4k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
281
13k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
How GitHub (no longer) Works
holman
314
140k
We Have a Design System, Now What?
morganepeng
53
7.7k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.5k
YesSQL, Process and Tooling at Scale
rocio
173
14k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
252
21k
The World Runs on Bad Software
bkeepers
PRO
69
11k
Build your cross-platform service in a week with App Engine
jlugia
231
18k
Transcript
Fewer Better Systems Monitorama EU 2013 Mark McGranaghan
@mmcgrana
Fewer Better Systems
Unix
everything is a file
/var/db /usr/lib /dev/tcp /usr/bin /etc
/dev/tcp
problem problem problem
everything is a ...
failover
primary secondary
primary secondary
primary secondary
primary secondary?
https://twitter.com/b6n/status/161899319459463168
the best systems are used constantly
Fewer Better Systems
everything is a ...
the best systems are used constantly
logs / events alert criteria / metrics integration testing /
QoS monitoring errors / results
logs / events
logs: stream of unstructured information events: stream of structured information
logs 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200 [notice] SQL
(0.5ms) SELECT users Completed in 64ms (View: 52, DB: 10) | 200 OK [/users/7]
invent ways to encode data in text...
data "data" | data <data> - data [data] (data)
meanwhile...
Apache log parsers / analyzers Postgres log parses / analyzers
Redis log parsers / analyzers Heroku log parsers / analyzers ...
everything is a ...
events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }
64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200
events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }
encode data as data, uniformly
analyze with general tools
open source
http://fluentd.org
{ :time "2013-09-19 10:27:39", :tag "web.request", :record { :ip "64.242.88.10",
:path "/users/7", ... } }
Web apps ---+ +--> file | | +--> ---+ /var/log
------> Fluentd ------> mail +--> ---+ | | Apache ---- +--> Fluentd http://fluentd.org
problem problem problem
something happened at some time: event events as data, not
text general-purpose event processing applicable to all information
everything is a ...
alert criteria / metrics
alert criteria: measure, alert if out of bounds metrics: measure,
store for analysis
measure measure alert store
measure measure alert store steady-state
measure measure alert store alert!
measure measure alert store steady-state
measure alert store
production
None
every alert has time series alter time series come from
metrics stack alert source data stored all the time
the best systems are used constantly
integration testing / QoS monitoring
https://plus.google.com/112678702228711889851/posts/eVeouesvaVX
integration testing: is good for production? QoS monitoring: is it
good in production?
integration testing run through common user flows, assert no errors,
ensure performance adequate
quality of service (QoS) monitoring users running through flows asserting
no/minimal errors, ensuring performance adequate
integration prod staging user load QoS monitoring
Integration prod staging user load QoS monitoring
staging prod user load load gen QoS monitoring QoS monitoring
load gen
invest in load generation/replay invest in granular QoS monitoring applicable
to all environments, all the time
the best systems are used constantly
errors / results
raise(“it’s tricky”)
errors: something happened, it was bad results: something happened, it
was OK
begin res = call_fn(arg) # handle result rescue => err
# handle error end
None
None
exceptions are only exceptional at small scale “1 in a
billion” @ 100k op/s ≃ 10 times a day
begin res = call_fn(arg) # handle result rescue => err
# handle error end
open source
http://golang.org
http://golang.org res, err := RunOp(arg) if err != nil {
// handle error } // handle result
begin res = run_op(arg) # handle result rescue => err
# handle error end
locality? in general: not local in space - service-level errors
etc not local in time - defined post hoc!
what even is an error? you don’t know at dev-time
when it’s just a result... emit event for later analysis
treat “exceptions” / results symmetrically to the greatest extent possible
expect to define errors at analysis-time, not just dev-time or run-time, based on results
everything is a ...
logs / events / metrics alert criteria / metrics integration
testing / QoS monitoring errors / results
a challenge
everything is a ...
the best systems are used constantly
Fewer Better Systems