Monitoring 101 — How To Monitor at Scale
by
Benjamin Fernandes
×
Copy
Open
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Slide 1
Slide 1 text
MONITORING 101 HOW TO MONITOR AT SCALE Benjamin Fernandes @LotharSee, dotScale 2016
Slide 2
Slide 2 text
LET’S USE AN EXAMPLE ▸ We are building a new dating app (for dogs) DATE-A-DOGE ▸ Let’s monitor it and watch its growth
Slide 3
Slide 3 text
MILESTONE #1 ▸ Initial architecture ▸ Working BETA version
Slide 4
Slide 4 text
FIRST RULE ▸ Measure everything! ▸ All layers are useful ▸ Application ▸ Web server, proxies ▸ DBs, caches ▸ 3rd party providers, …
Slide 5
Slide 5 text
“IF YOU CAN’T MEASURE IT, YOU CAN’T IMPROVE IT”
Slide 6
Slide 6 text
WATCH YOUR WORK METRICS FIRST ▸ Top-level health of your system by measuring its useful output ▸ Throughput ▸ Success ▸ Error ▸ Performance
Slide 7
Slide 7 text
DIFFERENT KIND OF MONITORED DATA ▸ Resource metrics and events to investigate
Slide 8
Slide 8 text
A FEW EXAMPLES ▸ web.latency ▸ web.errors ▸ api.calls ▸ downloads ▸ dog.swipes ▸ system.load ▸ disk.used ▸ db.queries ▸ cache.latency ▸ net.sent ▸ new release ▸ commits ▸ 3rd party down ▸ tweets WORK RESOURCE EVENTS
Slide 9
Slide 9 text
DASHBOARDS FOR EVERYTHING
Slide 10
Slide 10 text
MILESTONE #2 ▸ Our app is now live ▸ First users… ▸ First outage!
Slide 11
Slide 11 text
ALERTS BEFORE IT IS TOO LATE ▸ Create alerts on key metrics ▸ Do it on work metrics first ▸ Record, notify or page depending on the severity
Slide 12
Slide 12 text
SYSTEMATIC INVESTIGATION ▸ From work metrics, dig into resource metrics ▸ Did something changed? Check events
Slide 13
Slide 13 text
MILESTONE #3 ▸ Getting big ▸ 100+ servers ▸ 10k+ users
Slide 14
Slide 14 text
MONITORING HAS TO SCALE TOO ▸ Host-centric doesn’t work ▸ Need to focus on services
Slide 15
Slide 15 text
MULTIPLE DIMENSIONS WITH TAGS
Slide 16
Slide 16 text
EXTRA SPACIAL DIMENSIONS
Slide 17
Slide 17 text
WHAT WE CALL A METRIC
Slide 18
Slide 18 text
USE TAGS, EVERYWHERE ▸ Graph and alert on tags ▸ To scope and to aggregate ▸ Web latency per region < 50 ms on prod
Slide 19
Slide 19 text
TAG WITH ANYTHING USEFUL ▸ Service ▸ Version ▸ Environment ▸ User ID ▸ API endpoint ▸ Dog breed
Slide 20
Slide 20 text
SLICE AND DICE ▸ Same metric, different angles
Slide 21
Slide 21 text
MILESTONE #4 ▸ More hosts! ▸ More services! ▸ Containers!
Slide 22
Slide 22 text
METRIC CARDINALITY ▸ Don’t get lost ▸ Keep using tags, everywhere ▸ Stay focus on work metrics ▸ Monitoring gets sharper with experience
Slide 23
Slide 23 text
METRIC AGGREGATION ▸ Watch min/avg/max ▸ Watch percentiles and outliers ▸ How fast is my app for the 95th percentile?
Slide 24
Slide 24 text
NOW YOU HAVE ALL THE TOOLS ▸ Happy monitoring!