Monitoring 101 — How To Monitor at Scale
by
Benjamin Fernandes
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
MONITORING 101 HOW TO MONITOR AT SCALE Benjamin Fernandes @LotharSee, dotScale 2016
Slide 2
Slide 2 text
LET’S USE AN EXAMPLE ▸ We are building a new dating app (for dogs) DATE-A-DOGE ▸ Let’s monitor it and watch its growth
Slide 3
Slide 3 text
MILESTONE #1 ▸ Initial architecture ▸ Working BETA version
Slide 4
Slide 4 text
FIRST RULE ▸ Measure everything! ▸ All layers are useful ▸ Application ▸ Web server, proxies ▸ DBs, caches ▸ 3rd party providers, …
Slide 5
Slide 5 text
“IF YOU CAN’T MEASURE IT, YOU CAN’T IMPROVE IT”
Slide 6
Slide 6 text
WATCH YOUR WORK METRICS FIRST ▸ Top-level health of your system by measuring its useful output ▸ Throughput ▸ Success ▸ Error ▸ Performance
Slide 7
Slide 7 text
DIFFERENT KIND OF MONITORED DATA ▸ Resource metrics and events to investigate
Slide 8
Slide 8 text
A FEW EXAMPLES ▸ web.latency ▸ web.errors ▸ api.calls ▸ downloads ▸ dog.swipes ▸ system.load ▸ disk.used ▸ db.queries ▸ cache.latency ▸ net.sent ▸ new release ▸ commits ▸ 3rd party down ▸ tweets WORK RESOURCE EVENTS
Slide 9
Slide 9 text
DASHBOARDS FOR EVERYTHING
Slide 10
Slide 10 text
MILESTONE #2 ▸ Our app is now live ▸ First users… ▸ First outage!
Slide 11
Slide 11 text
ALERTS BEFORE IT IS TOO LATE ▸ Create alerts on key metrics ▸ Do it on work metrics first ▸ Record, notify or page depending on the severity
Slide 12
Slide 12 text
SYSTEMATIC INVESTIGATION ▸ From work metrics, dig into resource metrics ▸ Did something changed? Check events
Slide 13
Slide 13 text
MILESTONE #3 ▸ Getting big ▸ 100+ servers ▸ 10k+ users
Slide 14
Slide 14 text
MONITORING HAS TO SCALE TOO ▸ Host-centric doesn’t work ▸ Need to focus on services
Slide 15
Slide 15 text
MULTIPLE DIMENSIONS WITH TAGS
Slide 16
Slide 16 text
EXTRA SPACIAL DIMENSIONS
Slide 17
Slide 17 text
WHAT WE CALL A METRIC
Slide 18
Slide 18 text
USE TAGS, EVERYWHERE ▸ Graph and alert on tags ▸ To scope and to aggregate ▸ Web latency per region < 50 ms on prod
Slide 19
Slide 19 text
TAG WITH ANYTHING USEFUL ▸ Service ▸ Version ▸ Environment ▸ User ID ▸ API endpoint ▸ Dog breed
Slide 20
Slide 20 text
SLICE AND DICE ▸ Same metric, different angles
Slide 21
Slide 21 text
MILESTONE #4 ▸ More hosts! ▸ More services! ▸ Containers!
Slide 22
Slide 22 text
METRIC CARDINALITY ▸ Don’t get lost ▸ Keep using tags, everywhere ▸ Stay focus on work metrics ▸ Monitoring gets sharper with experience
Slide 23
Slide 23 text
METRIC AGGREGATION ▸ Watch min/avg/max ▸ Watch percentiles and outliers ▸ How fast is my app for the 95th percentile?
Slide 24
Slide 24 text
NOW YOU HAVE ALL THE TOOLS ▸ Happy monitoring!