Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Berlin 2013 - Session - Mark McGranaghan
Search
Monitorama
September 19, 2013
1
290
Berlin 2013 - Session - Mark McGranaghan
Monitorama
September 19, 2013
Tweet
Share
More Decks by Monitorama
See All by Monitorama
Monitorama PDX 2017 - Ian Bennett
monitorama
1
560
PDX 2017 - Pedro Andrade
monitorama
0
660
PDX 2017 - Roy Rapoport
monitorama
4
890
PDX 2017 - Julia Evans
monitorama
0
420
Berlin 2013 - Session - Brad Lhotsky
monitorama
5
660
Berlin 2013 - Session - Alex Petrov
monitorama
6
640
Berlin 2013 - Session - Jeff Weinstein
monitorama
2
570
Berlin 2013 - Session - Oliver Hankeln
monitorama
1
490
Berlin 2013 - Session - David Goodlad
monitorama
0
380
Featured
See All Featured
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
47
2.1k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
229
52k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
126
18k
Site-Speed That Sticks
csswizardry
0
84
Testing 201, or: Great Expectations
jmmastey
40
7.1k
Automating Front-end Workflow
addyosmani
1366
200k
BBQ
matthewcrist
85
9.3k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
356
29k
Agile that works and the tools we love
rasmusluckow
327
21k
Designing Experiences People Love
moore
138
23k
Designing on Purpose - Digital PM Summit 2013
jponch
115
7k
Building Adaptive Systems
keathley
38
2.3k
Transcript
Fewer Better Systems Monitorama EU 2013 Mark McGranaghan
@mmcgrana
Fewer Better Systems
Unix
everything is a file
/var/db /usr/lib /dev/tcp /usr/bin /etc
/dev/tcp
problem problem problem
everything is a ...
failover
primary secondary
primary secondary
primary secondary
primary secondary?
https://twitter.com/b6n/status/161899319459463168
the best systems are used constantly
Fewer Better Systems
everything is a ...
the best systems are used constantly
logs / events alert criteria / metrics integration testing /
QoS monitoring errors / results
logs / events
logs: stream of unstructured information events: stream of structured information
logs 64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200 [notice] SQL
(0.5ms) SELECT users Completed in 64ms (View: 52, DB: 10) | 200 OK [/users/7]
invent ways to encode data in text...
data "data" | data <data> - data [data] (data)
meanwhile...
Apache log parsers / analyzers Postgres log parses / analyzers
Redis log parsers / analyzers Heroku log parsers / analyzers ...
everything is a ...
events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }
64.242.88.10 - [19/Sep/2013:10:27:39] "GET /users/7 HTTP/1.1" 200
events { :time "2013-09-19 10:27:39" :action "users.get" :user_id 7 :method
"GET" :path "/users/7" :ip "64.242.88.10" ... }
encode data as data, uniformly
analyze with general tools
open source
http://fluentd.org
{ :time "2013-09-19 10:27:39", :tag "web.request", :record { :ip "64.242.88.10",
:path "/users/7", ... } }
Web apps ---+ +--> file | | +--> ---+ /var/log
------> Fluentd ------> mail +--> ---+ | | Apache ---- +--> Fluentd http://fluentd.org
problem problem problem
something happened at some time: event events as data, not
text general-purpose event processing applicable to all information
everything is a ...
alert criteria / metrics
alert criteria: measure, alert if out of bounds metrics: measure,
store for analysis
measure measure alert store
measure measure alert store steady-state
measure measure alert store alert!
measure measure alert store steady-state
measure alert store
production
None
every alert has time series alter time series come from
metrics stack alert source data stored all the time
the best systems are used constantly
integration testing / QoS monitoring
https://plus.google.com/112678702228711889851/posts/eVeouesvaVX
integration testing: is good for production? QoS monitoring: is it
good in production?
integration testing run through common user flows, assert no errors,
ensure performance adequate
quality of service (QoS) monitoring users running through flows asserting
no/minimal errors, ensuring performance adequate
integration prod staging user load QoS monitoring
Integration prod staging user load QoS monitoring
staging prod user load load gen QoS monitoring QoS monitoring
load gen
invest in load generation/replay invest in granular QoS monitoring applicable
to all environments, all the time
the best systems are used constantly
errors / results
raise(“it’s tricky”)
errors: something happened, it was bad results: something happened, it
was OK
begin res = call_fn(arg) # handle result rescue => err
# handle error end
None
None
exceptions are only exceptional at small scale “1 in a
billion” @ 100k op/s ≃ 10 times a day
begin res = call_fn(arg) # handle result rescue => err
# handle error end
open source
http://golang.org
http://golang.org res, err := RunOp(arg) if err != nil {
// handle error } // handle result
begin res = run_op(arg) # handle result rescue => err
# handle error end
locality? in general: not local in space - service-level errors
etc not local in time - defined post hoc!
what even is an error? you don’t know at dev-time
when it’s just a result... emit event for later analysis
treat “exceptions” / results symmetrically to the greatest extent possible
expect to define errors at analysis-time, not just dev-time or run-time, based on results
everything is a ...
logs / events / metrics alert criteria / metrics integration
testing / QoS monitoring errors / results
a challenge
everything is a ...
the best systems are used constantly
Fewer Better Systems