Monitoring 101 — How To Monitor at Scale

MONITORING 101 HOW TO MONITOR AT SCALE Benjamin Fernandes @LotharSee,
dotScale 2016

LET’S USE AN EXAMPLE ▸ We are building a new
dating app (for dogs)  DATE-A-DOGE ▸ Let’s monitor it and watch its growth

MILESTONE #1 ▸ Initial architecture ▸ Working BETA version

FIRST RULE ▸ Measure everything! ▸ All layers are useful
▸ Application ▸ Web server, proxies ▸ DBs, caches ▸ 3rd party providers, …

“IF YOU CAN’T MEASURE IT, YOU CAN’T IMPROVE IT”

WATCH YOUR WORK METRICS FIRST ▸ Top-level health of your
system by measuring its useful output ▸ Throughput ▸ Success ▸ Error ▸ Performance

DIFFERENT KIND OF MONITORED DATA ▸ Resource metrics and events
to investigate

A FEW EXAMPLES ▸ web.latency ▸ web.errors ▸ api.calls ▸
downloads ▸ dog.swipes ▸ system.load ▸ disk.used ▸ db.queries ▸ cache.latency ▸ net.sent ▸ new release ▸ commits ▸ 3rd party down ▸ tweets WORK RESOURCE EVENTS

DASHBOARDS FOR EVERYTHING

MILESTONE #2 ▸ Our app is now live ▸ First
users… ▸ First outage!

ALERTS BEFORE IT IS TOO LATE ▸ Create alerts on
key metrics ▸ Do it on work metrics ﬁrst ▸ Record, notify or page depending on the severity

SYSTEMATIC INVESTIGATION ▸ From work metrics, dig into resource metrics
▸ Did something changed? Check events

MILESTONE #3 ▸ Getting big ▸ 100+ servers ▸ 10k+
users

MONITORING HAS TO SCALE TOO ▸ Host-centric doesn’t work ▸
Need to focus on services

MULTIPLE DIMENSIONS WITH TAGS

EXTRA SPACIAL DIMENSIONS

WHAT WE CALL A METRIC

USE TAGS, EVERYWHERE ▸ Graph and alert on tags ▸
To scope and to aggregate ▸ Web latency per region < 50 ms on prod

TAG WITH ANYTHING USEFUL ▸ Service ▸ Version ▸ Environment
▸ User ID ▸ API endpoint ▸ Dog breed

SLICE AND DICE ▸ Same metric, different angles

MILESTONE #4 ▸ More hosts! ▸ More services! ▸ Containers!

METRIC CARDINALITY ▸ Don’t get lost ▸ Keep using tags,
everywhere ▸ Stay focus on work metrics ▸ Monitoring gets sharper with experience

METRIC AGGREGATION ▸ Watch min/avg/max ▸ Watch percentiles and outliers
▸ How fast is my app for the 95th percentile?

NOW YOU HAVE ALL THE TOOLS ▸ Happy monitoring!

Monitoring 101 — How To Monitor at Scale

Monitoring 101 — How To Monitor at Scale

Benjamin Fernandes

More Decks by Benjamin Fernandes

Other Decks in Technology

Featured

Transcript