Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Boston 2013 - Session - Laura Thomson
Search
Monitorama
March 28, 2013
400
0
Share
Boston 2013 - Session - Laura Thomson
Monitorama
March 28, 2013
More Decks by Monitorama
See All by Monitorama
Monitorama PDX 2017 - Ian Bennett
monitorama
1
630
PDX 2017 - Pedro Andrade
monitorama
0
820
PDX 2017 - Roy Rapoport
monitorama
4
990
PDX 2017 - Julia Evans
monitorama
0
520
Berlin 2013 - Session - Brad Lhotsky
monitorama
5
760
Berlin 2013 - Session - Alex Petrov
monitorama
6
720
Berlin 2013 - Session - Jeff Weinstein
monitorama
2
660
Berlin 2013 - Session - Oliver Hankeln
monitorama
1
580
Berlin 2013 - Session - David Goodlad
monitorama
0
510
Featured
See All Featured
Accessibility Awareness
sabderemane
1
110
Ruling the World: When Life Gets Gamed
codingconduct
0
210
Code Review Best Practice
trishagee
74
20k
The Art of Programming - Codeland 2020
erikaheidi
57
14k
The Cult of Friendly URLs
andyhume
79
6.8k
The Cost Of JavaScript in 2023
addyosmani
55
9.9k
How to build a perfect <img>
jonoalderson
1
5.4k
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
490
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
1.2k
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
330
The Mindset for Success: Future Career Progression
greggifford
PRO
0
310
How to Ace a Technical Interview
jacobian
281
24k
Transcript
Many Moving Parts Monitoring Complex Systems
[email protected]
1 Wednesday, May
29, 13
Many Moving Parts Monitoring Complex Systems (at roflscale) 2 Wednesday,
May 29, 13
Confession: I’m not a sysadmin. 3 Wednesday, May 29, 13
(Why does devops == ops, anyway?) 4 Wednesday, May 29,
13
5 Wednesday, May 29, 13
webapp& db& simple system, simple monitoring 6 Wednesday, May 29,
13
webapp& db& cache& slave& slightly more complex 7 Wednesday, May
29, 13
Most of us don’t work on simple stuff 8 Wednesday,
May 29, 13
Most of us don’t work on simple stuff ...and if
you do I hate you. 9 Wednesday, May 29, 13
Most of us don’t work on simple stuff ...and if
you do I hate you. (just kidding) 10 Wednesday, May 29, 13
Some of our stuff looks like this 11 Wednesday, May
29, 13
Some of our stuff looks like this (avert your eyes
now if you are easily scared) 12 Wednesday, May 29, 13
13 Wednesday, May 29, 13
It’s actually more complicated than that 14 Wednesday, May 29,
13
Socorro Very Large Array at Socorro, New Mexico, USA. Photo
taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg 15 Wednesday, May 29, 13
16 Wednesday, May 29, 13
17 Wednesday, May 29, 13
18 Wednesday, May 29, 13
Collection collector' crashmover' filesystem' HBase' 19 Wednesday, May 29, 13
Processing HBase& PostgreSQL& Elas1cSearch& monitor& processor& Symbol&store& minidumpstackwalk& 20 Wednesday,
May 29, 13
Reporting HBase& PostgreSQL& Elas1cSearch& middleware& webapp& memcache& crons& Other&data&sources& 21
Wednesday, May 29, 13
> 120 physical boxes (not cloud) ~10 developers + DBAs
+ sysadmin team + QA + Hadoop ops 22 Wednesday, May 29, 13
3000 crashes per minute 3 million per day Crash size
150k - 20MB ~800GB stored in PostgreSQL ~110TB stored in HDFS 23 Wednesday, May 29, 13
like many complex systems it’s a data pipeline or firehose,
if you prefer 24 Wednesday, May 29, 13
25 Wednesday, May 29, 13
26 Wednesday, May 29, 13
Diagnostic Indirect Threshold Trend Performance Business 27 Wednesday, May 29,
13
Diagnostic 28 Wednesday, May 29, 13
Host DOWN 500 ISE Replication lag 29 Wednesday, May 29,
13
You know where to look You have a good idea
about what to fix Not always simple, but often well-defined 30 Wednesday, May 29, 13
Indirect 31 Wednesday, May 29, 13
FILE_AGE CRITICAL: blah.log is M seconds old Last record in
database N seconds ago 32 Wednesday, May 29, 13
Something is wrong Maybe with the monitored component Maybe somewhere
upstream 33 Wednesday, May 29, 13
Why is this useful? 34 Wednesday, May 29, 13
High level exception handlers The thing you don’t know to
monitor yet The thing you don’t know how to monitor 35 Wednesday, May 29, 13
You know where to start looking You might have to
look deeper too 36 Wednesday, May 29, 13
Threshold 37 Wednesday, May 29, 13
DISK WARNING - free space: (% used) More files on
disk than there ought to be 38 Wednesday, May 29, 13
Sometimes simple (disk space) Sometimes complex root cause (files) Sometimes
hard to measure 39 Wednesday, May 29, 13
1% errors = normal, expected 5% errors = something bad
is happening 40 Wednesday, May 29, 13
Error rates Count errors (statsd, etc) per window Monitor on
counts (rate) 41 Wednesday, May 29, 13
Trend 42 Wednesday, May 29, 13
Disk is 85% full Did it get that way over
months? Did it get that way in one night? 43 Wednesday, May 29, 13
Trends are important Rates of change are important 44 Wednesday,
May 29, 13
Top crashes (count) Explosive crashes (trend) 45 Wednesday, May 29,
13
Performance 46 Wednesday, May 29, 13
Page load times Other component response times X items processed/minute
47 Wednesday, May 29, 13
Tooling is improving Traditionally more for dev than ops Needs
threshold/trend alerting for ops 48 Wednesday, May 29, 13
Business 49 Wednesday, May 29, 13
Transactions/hour Conversion rate Volumes 50 Wednesday, May 29, 13
Just another performance monitor Thresholds Trends Alerts 51 Wednesday, May
29, 13
Often these exist in human form AUTOMATE Better a page
than an angry boss/customer 52 Wednesday, May 29, 13
53 Wednesday, May 29, 13
You’ve probably heard: Monitoring and testing converge 54 Wednesday, May
29, 13
Running tests on prod can be awesome except when it
isn’t (Knight) (be careful) 55 Wednesday, May 29, 13
two kinds: safe for prod not safe for prod (write,
load, etc) 56 Wednesday, May 29, 13
Monitor as unit test: When you have a failure, add
a monitor (coverage is hard to measure) 57 Wednesday, May 29, 13
Questions?
[email protected]
@lxt 58 Wednesday, May 29, 13