Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Boston 2013 - Session - Laura Thomson
Search
Monitorama
March 28, 2013
0
370
Boston 2013 - Session - Laura Thomson
Monitorama
March 28, 2013
Tweet
Share
More Decks by Monitorama
See All by Monitorama
Monitorama PDX 2017 - Ian Bennett
monitorama
1
560
PDX 2017 - Pedro Andrade
monitorama
0
660
PDX 2017 - Roy Rapoport
monitorama
4
890
PDX 2017 - Julia Evans
monitorama
0
420
Berlin 2013 - Session - Brad Lhotsky
monitorama
5
660
Berlin 2013 - Session - Alex Petrov
monitorama
6
640
Berlin 2013 - Session - Jeff Weinstein
monitorama
2
570
Berlin 2013 - Session - Oliver Hankeln
monitorama
1
490
Berlin 2013 - Session - David Goodlad
monitorama
0
380
Featured
See All Featured
StorybookのUI Testing Handbookを読んだ
zakiyama
27
5.3k
How to Ace a Technical Interview
jacobian
276
23k
What's in a price? How to price your products and services
michaelherold
243
12k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
126
18k
Automating Front-end Workflow
addyosmani
1366
200k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
44
6.8k
No one is an island. Learnings from fostering a developers community.
thoeni
19
3k
Practical Orchestrator
shlominoach
186
10k
Music & Morning Musume
bryan
46
6.2k
A Modern Web Designer's Workflow
chriscoyier
693
190k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
16
2.1k
Mobile First: as difficult as doing things right
swwweet
222
8.9k
Transcript
Many Moving Parts Monitoring Complex Systems
[email protected]
1 Wednesday, May
29, 13
Many Moving Parts Monitoring Complex Systems (at roflscale) 2 Wednesday,
May 29, 13
Confession: I’m not a sysadmin. 3 Wednesday, May 29, 13
(Why does devops == ops, anyway?) 4 Wednesday, May 29,
13
5 Wednesday, May 29, 13
webapp& db& simple system, simple monitoring 6 Wednesday, May 29,
13
webapp& db& cache& slave& slightly more complex 7 Wednesday, May
29, 13
Most of us don’t work on simple stuff 8 Wednesday,
May 29, 13
Most of us don’t work on simple stuff ...and if
you do I hate you. 9 Wednesday, May 29, 13
Most of us don’t work on simple stuff ...and if
you do I hate you. (just kidding) 10 Wednesday, May 29, 13
Some of our stuff looks like this 11 Wednesday, May
29, 13
Some of our stuff looks like this (avert your eyes
now if you are easily scared) 12 Wednesday, May 29, 13
13 Wednesday, May 29, 13
It’s actually more complicated than that 14 Wednesday, May 29,
13
Socorro Very Large Array at Socorro, New Mexico, USA. Photo
taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg 15 Wednesday, May 29, 13
16 Wednesday, May 29, 13
17 Wednesday, May 29, 13
18 Wednesday, May 29, 13
Collection collector' crashmover' filesystem' HBase' 19 Wednesday, May 29, 13
Processing HBase& PostgreSQL& Elas1cSearch& monitor& processor& Symbol&store& minidumpstackwalk& 20 Wednesday,
May 29, 13
Reporting HBase& PostgreSQL& Elas1cSearch& middleware& webapp& memcache& crons& Other&data&sources& 21
Wednesday, May 29, 13
> 120 physical boxes (not cloud) ~10 developers + DBAs
+ sysadmin team + QA + Hadoop ops 22 Wednesday, May 29, 13
3000 crashes per minute 3 million per day Crash size
150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS 23 Wednesday, May 29, 13
like many complex systems it’s a data pipeline or firehose,
if you prefer 24 Wednesday, May 29, 13
25 Wednesday, May 29, 13
26 Wednesday, May 29, 13
Diagnostic Indirect Threshold Trend Performance Business 27 Wednesday, May 29,
13
Diagnostic 28 Wednesday, May 29, 13
Host DOWN 500 ISE Replication lag 29 Wednesday, May 29,
13
You know where to look You have a good idea
about what to fix Not always simple, but often well-defined 30 Wednesday, May 29, 13
Indirect 31 Wednesday, May 29, 13
FILE_AGE CRITICAL: blah.log is M seconds old Last record in
database N seconds ago 32 Wednesday, May 29, 13
Something is wrong Maybe with the monitored component Maybe somewhere
upstream 33 Wednesday, May 29, 13
Why is this useful? 34 Wednesday, May 29, 13
High level exception handlers The thing you don’t know to
monitor yet The thing you don’t know how to monitor 35 Wednesday, May 29, 13
You know where to start looking You might have to
look deeper too 36 Wednesday, May 29, 13
Threshold 37 Wednesday, May 29, 13
DISK WARNING - free space: (% used) More files on
disk than there ought to be 38 Wednesday, May 29, 13
Sometimes simple (disk space) Sometimes complex root cause (files) Sometimes
hard to measure 39 Wednesday, May 29, 13
1% errors = normal, expected 5% errors = something bad
is happening 40 Wednesday, May 29, 13
Error rates Count errors (statsd, etc) per window Monitor on
counts (rate) 41 Wednesday, May 29, 13
Trend 42 Wednesday, May 29, 13
Disk is 85% full Did it get that way over
months? Did it get that way in one night? 43 Wednesday, May 29, 13
Trends are important Rates of change are important 44 Wednesday,
May 29, 13
Top crashes (count) Explosive crashes (trend) 45 Wednesday, May 29,
13
Performance 46 Wednesday, May 29, 13
Page load times Other component response times X items processed/minute
47 Wednesday, May 29, 13
Tooling is improving Traditionally more for dev than ops Needs
threshold/trend alerting for ops 48 Wednesday, May 29, 13
Business 49 Wednesday, May 29, 13
Transactions/hour Conversion rate Volumes 50 Wednesday, May 29, 13
Just another performance monitor Thresholds Trends Alerts 51 Wednesday, May
29, 13
Often these exist in human form AUTOMATE Better a page
than an angry boss/customer 52 Wednesday, May 29, 13
53 Wednesday, May 29, 13
You’ve probably heard: Monitoring and testing converge 54 Wednesday, May
29, 13
Running tests on prod can be awesome except when it
isn’t (Knight) (be careful) 55 Wednesday, May 29, 13
two kinds: safe for prod not safe for prod (write,
load, etc) 56 Wednesday, May 29, 13
Monitor as unit test: When you have a failure, add
a monitor (coverage is hard to measure) 57 Wednesday, May 29, 13
Questions?
[email protected]
@lxt 58 Wednesday, May 29, 13