Slide 1

Slide 1 text

MONITORING 101 HOW TO MONITOR AT SCALE Benjamin Fernandes @LotharSee, dotScale 2016

Slide 2

Slide 2 text

LET’S USE AN EXAMPLE ▸ We are building a new dating app (for dogs)
 DATE-A-DOGE ▸ Let’s monitor it and watch its growth

Slide 3

Slide 3 text

MILESTONE #1 ▸ Initial architecture ▸ Working BETA version

Slide 4

Slide 4 text

FIRST RULE ▸ Measure everything! ▸ All layers are useful ▸ Application ▸ Web server, proxies ▸ DBs, caches ▸ 3rd party providers, …

Slide 5

Slide 5 text

“IF YOU CAN’T MEASURE IT, YOU CAN’T IMPROVE IT”

Slide 6

Slide 6 text

WATCH YOUR WORK METRICS FIRST ▸ Top-level health of your system by measuring its useful output ▸ Throughput ▸ Success ▸ Error ▸ Performance

Slide 7

Slide 7 text

DIFFERENT KIND OF MONITORED DATA ▸ Resource metrics and events to investigate

Slide 8

Slide 8 text

A FEW EXAMPLES ▸ web.latency ▸ web.errors ▸ api.calls ▸ downloads ▸ dog.swipes ▸ system.load ▸ disk.used ▸ db.queries ▸ cache.latency ▸ net.sent ▸ new release ▸ commits ▸ 3rd party down ▸ tweets WORK RESOURCE EVENTS

Slide 9

Slide 9 text

DASHBOARDS FOR EVERYTHING

Slide 10

Slide 10 text

MILESTONE #2 ▸ Our app is now live ▸ First users… ▸ First outage!

Slide 11

Slide 11 text

ALERTS BEFORE IT IS TOO LATE ▸ Create alerts on key metrics ▸ Do it on work metrics first ▸ Record, notify or page depending on the severity

Slide 12

Slide 12 text

SYSTEMATIC INVESTIGATION ▸ From work metrics, dig into resource metrics ▸ Did something changed? Check events

Slide 13

Slide 13 text

MILESTONE #3 ▸ Getting big ▸ 100+ servers ▸ 10k+ users

Slide 14

Slide 14 text

MONITORING HAS TO SCALE TOO ▸ Host-centric doesn’t work ▸ Need to focus on services

Slide 15

Slide 15 text

MULTIPLE DIMENSIONS WITH TAGS

Slide 16

Slide 16 text

EXTRA SPACIAL DIMENSIONS

Slide 17

Slide 17 text

WHAT WE CALL A METRIC

Slide 18

Slide 18 text

USE TAGS, EVERYWHERE ▸ Graph and alert on tags ▸ To scope and to aggregate ▸ Web latency per region < 50 ms on prod

Slide 19

Slide 19 text

TAG WITH ANYTHING USEFUL ▸ Service ▸ Version ▸ Environment ▸ User ID ▸ API endpoint ▸ Dog breed

Slide 20

Slide 20 text

SLICE AND DICE ▸ Same metric, different angles

Slide 21

Slide 21 text

MILESTONE #4 ▸ More hosts! ▸ More services! ▸ Containers!

Slide 22

Slide 22 text

METRIC CARDINALITY ▸ Don’t get lost ▸ Keep using tags, everywhere ▸ Stay focus on work metrics ▸ Monitoring gets sharper with experience

Slide 23

Slide 23 text

METRIC AGGREGATION ▸ Watch min/avg/max ▸ Watch percentiles and outliers ▸ How fast is my app for the 95th percentile?

Slide 24

Slide 24 text

NOW YOU HAVE ALL THE TOOLS ▸ Happy monitoring!