Boston 2013 - Session - Laura Thomson

Many Moving Parts Monitoring Complex Systems [email protected] 1 Wednesday, May
29, 13

Many Moving Parts Monitoring Complex Systems (at roﬂscale) 2 Wednesday,
May 29, 13

Confession: I’m not a sysadmin. 3 Wednesday, May 29, 13

(Why does devops == ops, anyway?) 4 Wednesday, May 29,
13

5 Wednesday, May 29, 13

webapp& db& simple system, simple monitoring 6 Wednesday, May 29,
13

webapp& db& cache& slave& slightly more complex 7 Wednesday, May
29, 13

Most of us don’t work on simple stuff 8 Wednesday,
May 29, 13

Most of us don’t work on simple stuff ...and if
you do I hate you. 9 Wednesday, May 29, 13

Most of us don’t work on simple stuff ...and if
you do I hate you. (just kidding) 10 Wednesday, May 29, 13

Some of our stuff looks like this 11 Wednesday, May
29, 13

Some of our stuff looks like this (avert your eyes
now if you are easily scared) 12 Wednesday, May 29, 13

It’s actually more complicated than that 14 Wednesday, May 29,
13

Socorro Very Large Array at Socorro, New Mexico, USA. Photo
taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg 15 Wednesday, May 29, 13

Collection collector' crashmover' ﬁlesystem' HBase' 19 Wednesday, May 29, 13

Processing HBase& PostgreSQL& Elas1cSearch& monitor& processor& Symbol&store& minidumpstackwalk& 20 Wednesday,
May 29, 13

Reporting HBase& PostgreSQL& Elas1cSearch& middleware& webapp& memcache& crons& Other&data&sources& 21
Wednesday, May 29, 13

> 120 physical boxes (not cloud) ~10 developers + DBAs
+ sysadmin team + QA + Hadoop ops 22 Wednesday, May 29, 13

3000 crashes per minute 3 million per day Crash size
150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS 23 Wednesday, May 29, 13

like many complex systems it’s a data pipeline or ﬁrehose,
if you prefer 24 Wednesday, May 29, 13

Diagnostic Indirect Threshold Trend Performance Business 27 Wednesday, May 29,
13

Diagnostic 28 Wednesday, May 29, 13

Host DOWN 500 ISE Replication lag 29 Wednesday, May 29,
13

You know where to look You have a good idea
about what to ﬁx Not always simple, but often well-deﬁned 30 Wednesday, May 29, 13

Indirect 31 Wednesday, May 29, 13

FILE_AGE CRITICAL: blah.log is M seconds old Last record in
database N seconds ago 32 Wednesday, May 29, 13

Something is wrong Maybe with the monitored component Maybe somewhere
upstream 33 Wednesday, May 29, 13

Why is this useful? 34 Wednesday, May 29, 13

High level exception handlers The thing you don’t know to
monitor yet The thing you don’t know how to monitor 35 Wednesday, May 29, 13

You know where to start looking You might have to
look deeper too 36 Wednesday, May 29, 13

Threshold 37 Wednesday, May 29, 13

DISK WARNING - free space: (% used) More ﬁles on
disk than there ought to be 38 Wednesday, May 29, 13

Sometimes simple (disk space) Sometimes complex root cause (ﬁles) Sometimes
hard to measure 39 Wednesday, May 29, 13

1% errors = normal, expected 5% errors = something bad
is happening 40 Wednesday, May 29, 13

Error rates Count errors (statsd, etc) per window Monitor on
counts (rate) 41 Wednesday, May 29, 13

Trend 42 Wednesday, May 29, 13

Disk is 85% full Did it get that way over
months? Did it get that way in one night? 43 Wednesday, May 29, 13

Trends are important Rates of change are important 44 Wednesday,
May 29, 13

Top crashes (count) Explosive crashes (trend) 45 Wednesday, May 29,
13

Performance 46 Wednesday, May 29, 13

Page load times Other component response times X items processed/minute

Tooling is improving Traditionally more for dev than ops Needs
threshold/trend alerting for ops 48 Wednesday, May 29, 13

Business 49 Wednesday, May 29, 13

Transactions/hour Conversion rate Volumes 50 Wednesday, May 29, 13

Just another performance monitor Thresholds Trends Alerts 51 Wednesday, May
29, 13

Often these exist in human form AUTOMATE Better a page
than an angry boss/customer 52 Wednesday, May 29, 13

You’ve probably heard: Monitoring and testing converge 54 Wednesday, May
29, 13

Running tests on prod can be awesome except when it
isn’t (Knight) (be careful) 55 Wednesday, May 29, 13

two kinds: safe for prod not safe for prod (write,
load, etc) 56 Wednesday, May 29, 13

Monitor as unit test: When you have a failure, add
a monitor (coverage is hard to measure) 57 Wednesday, May 29, 13

Questions? [email protected] @lxt 58 Wednesday, May 29, 13

Boston 2013 - Session - Laura Thomson

Boston 2013 - Session - Laura Thomson

More Decks by Monitorama

Featured

Transcript