Many Moving Parts: Monitoring Complex Systems

Many Moving Parts Monitoring Complex Systems ! ! [email protected] @lxt

Many Moving Parts Monitoring Complex Systems (at roﬂscale)

Confession: I’m not a sysadmin.

(Why does devops == ops, anyway?)

webapp& db& simple system, simple monitoring

webapp& db& cache& slave& slightly more complex

Most of us don’t work on simple stuff !

Most of us don’t work on simple stuff ...and if
you do I hate you.

Most of us don’t work on simple stuff ...and if
you do I hate you. (just kidding)

Some of our stuff looks like this

Some of our stuff looks like this (avert your eyes
now if you are easily scared)

and this

It’s actually more complicated than that

Socorro Very Large Array at Socorro, New Mexico, USA. Photo
taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg

> 120 physical boxes (not cloud) ~8 developers + ops
+ QA + Hadoop ops

3000 crashes per minute 3 million per day Crash size
150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS

like many complex systems it’s a data pipeline or ﬁrehose,
if you prefer

typically 4k build and 10k test jobs per day topped
100k/day in August new record for pushes/commits: 690 in a single day (also in August) hg.mozilla.org has: 3,445 unique repos; 32,123,211 total commits; 865,594 unique commits; 1,223 unique root commits ~6000 machines 80% of build and 50% of test in cloud strategic use of AWS spot instances saved 75% on our bill

First Principles (First: Principles!)

Pull for normal operating conditions Push soft alerts for warnings
Pages for critical issues

analytics tell you: “is this load normal?” visualize healthy system
performance makes it easier to ﬁnd anomalies

monitoring coverage is logically equivalent to test coverage (and suffers
the same issues)

the things that are easy to monitor/test vs the things
we really care about

just “more” is not better: noise is enemy #1

Thomson’s Hierarchy of Monitoring

Diagnostic Indirect Threshold Trend Performance Business

Diagnostic

Host DOWN 500 ISE Replication lag 

You know where to look You have a good idea
about what to ﬁx Not always simple, but often well-deﬁned

Indirect

FILE_AGE CRITICAL: blah.log is M seconds old Last record written
N seconds ago

Something is wrong Maybe with the monitored component Maybe somewhere
upstream

Why is this useful?

High level exception handlers The thing you don’t know to
monitor yet The thing you don’t know how to monitor

You know where to start looking You might have to
look deeper too

Threshold

DISK WARNING - free space: (% used) More ﬁles on
disk than there ought to be Running out of inodes

Sometimes simple (disk space) Sometimes complex root cause (ﬁles) Sometimes
hard to measure

1% errors = normal, expected 5% errors = something bad
is happening

Error rates Count errors (statsd, etc) per time window Monitor
on counts (rate)

Disk is 85% full Did it get that way over
months? Did it get that way in one night?

Trends are important Rates of change are important

Top crashes (count) Explosive crashes (trend)

Performance

Page load times Other component response times X items processed/minute

Tooling is improving Traditionally more for dev than ops Needs
threshold/trend alerting

Business

Transactions/hour Conversion rate Volumes

Performance monitors and multiple-levels-of-indirection monitors: ! Thresholds Trends Alerts

Often these exist in human form AUTOMATE Better a page
than an angry boss/customer

Testing

You’ve probably heard: Monitoring and testing converge

Running tests on prod can be awesome except when it
isn’t (Knight) (be careful)

two kinds: safe for prod not safe for prod (write,
load, etc)

Monitor as unit test: When you have an outage, add
a monitor

Resilience

Gracefully degrade Decouple unreliable components Load/feature shed Automate recovery

What changes? ! Thresholds more interesting than single alerts Many
alerts should become warnings (or risk alert fatigue)

Questions? [email protected] @lxt

Many Moving Parts: Monitoring Complex Systems

Many Moving Parts: Monitoring Complex Systems

More Decks by Laura Thomson

Other Decks in Technology

Featured

Transcript