Many Moving Parts: Monitoring Complex Systems

Slide 1

Slide 1 text

Many Moving Parts Monitoring Complex Systems ! ! [email protected] @lxt

Slide 2

Slide 2 text

Many Moving Parts Monitoring Complex Systems (at roﬂscale)

Slide 3

Slide 3 text

Confession: I’m not a sysadmin.

Slide 4

Slide 4 text

(Why does devops == ops, anyway?)

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

webapp& db& simple system, simple monitoring

Slide 7

Slide 7 text

webapp& db& cache& slave& slightly more complex

Slide 8

Slide 8 text

Most of us don’t work on simple stuff !

Slide 9

Slide 9 text

Most of us don’t work on simple stuff ...and if you do I hate you.

Slide 10

Slide 10 text

Most of us don’t work on simple stuff ...and if you do I hate you. (just kidding)

Slide 11

Slide 11 text

Some of our stuff looks like this

Slide 12

Slide 12 text

Some of our stuff looks like this (avert your eyes now if you are easily scared)

Slide 13

Slide 13 text

Slide 14

Slide 14 text

and this

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

It’s actually more complicated than that

Slide 17

Slide 17 text

Socorro Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

> 120 physical boxes (not cloud) ~8 developers + ops + QA + Hadoop ops

Slide 24

Slide 24 text

3000 crashes per minute 3 million per day Crash size 150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS

Slide 25

Slide 25 text

like many complex systems it’s a data pipeline or ﬁrehose, if you prefer

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

typically 4k build and 10k test jobs per day topped 100k/day in August new record for pushes/commits: 690 in a single day (also in August) hg.mozilla.org has: 3,445 unique repos; 32,123,211 total commits; 865,594 unique commits; 1,223 unique root commits ~6000 machines 80% of build and 50% of test in cloud strategic use of AWS spot instances saved 75% on our bill

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

First Principles (First: Principles!)

Slide 31

Slide 31 text

Pull for normal operating conditions Push soft alerts for warnings Pages for critical issues

Slide 32

Slide 32 text

analytics tell you: “is this load normal?” visualize healthy system performance makes it easier to ﬁnd anomalies

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

monitoring coverage is logically equivalent to test coverage (and suffers the same issues)

Slide 35

Slide 35 text

the things that are easy to monitor/test vs the things we really care about

Slide 36

Slide 36 text

just “more” is not better: noise is enemy #1

Slide 37

Slide 37 text

Thomson’s Hierarchy of Monitoring

Slide 38

Slide 38 text

Diagnostic Indirect Threshold Trend Performance Business

Slide 39

Slide 39 text

Diagnostic

Slide 40

Slide 40 text

Host DOWN 500 ISE Replication lag 

Slide 41

Slide 41 text

You know where to look You have a good idea about what to ﬁx Not always simple, but often well-deﬁned

Slide 42

Slide 42 text

Indirect

Slide 43

Slide 43 text

FILE_AGE CRITICAL: blah.log is M seconds old Last record written N seconds ago

Slide 44

Slide 44 text

Something is wrong Maybe with the monitored component Maybe somewhere upstream

Slide 45

Slide 45 text

Why is this useful?

Slide 46

Slide 46 text

High level exception handlers The thing you don’t know to monitor yet The thing you don’t know how to monitor

Slide 47

Slide 47 text

You know where to start looking You might have to look deeper too

Slide 48

Slide 48 text

Threshold

Slide 49

Slide 49 text

DISK WARNING - free space: (% used) More ﬁles on disk than there ought to be Running out of inodes

Slide 50

Slide 50 text

Sometimes simple (disk space) Sometimes complex root cause (ﬁles) Sometimes hard to measure

Slide 51

Slide 51 text

1% errors = normal, expected 5% errors = something bad is happening

Slide 52

Slide 52 text

Error rates Count errors (statsd, etc) per time window Monitor on counts (rate)

Slide 53

Slide 53 text

Trend

Slide 54

Slide 54 text

Disk is 85% full Did it get that way over months? Did it get that way in one night?

Slide 55

Slide 55 text

Trends are important Rates of change are important

Slide 56

Slide 56 text

Top crashes (count) Explosive crashes (trend)

Slide 57

Slide 57 text

Performance

Slide 58

Slide 58 text

Page load times Other component response times X items processed/minute

Slide 59

Slide 59 text

Tooling is improving Traditionally more for dev than ops Needs threshold/trend alerting

Slide 60

Slide 60 text

Business

Slide 61

Slide 61 text

Transactions/hour Conversion rate Volumes

Slide 62

Slide 62 text

Performance monitors and multiple-levels-of-indirection monitors: ! Thresholds Trends Alerts

Slide 63

Slide 63 text

Often these exist in human form AUTOMATE Better a page than an angry boss/customer

Slide 64

Slide 64 text

Testing

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

You’ve probably heard: Monitoring and testing converge

Slide 67

Slide 67 text

Running tests on prod can be awesome except when it isn’t (Knight) (be careful)

Slide 68

Slide 68 text

two kinds: safe for prod not safe for prod (write, load, etc)

Slide 69

Slide 69 text

Monitor as unit test: When you have an outage, add a monitor

Slide 70

Slide 70 text

Resilience

Slide 71

Slide 71 text

Gracefully degrade Decouple unreliable components Load/feature shed Automate recovery

Slide 72

Slide 72 text

What changes? ! Thresholds more interesting than single alerts Many alerts should become warnings (or risk alert fatigue)

Slide 73

Slide 73 text

Questions? [email protected] @lxt