Monitoring 101 — How To Monitor at Scale

Monitoring 101 — How To Monitor at Scale

Short introduction to the key concepts of monitoring at scale.
Lightning talk for dotScale 2016. http://dotscale.io

System monitoring is a wide topic, with hundred ways of doing it and tons of metrics to look at. Look at a growing web-application to illustrate the important monitoring concepts. Present what to look at first, how to exploit the data and what are the good practices at large scale.

Ff7c601449473197840b6b333b4318dc?s=128

Benjamin Fernandes

April 25, 2016
Tweet

Transcript

  1. MONITORING 101 HOW TO MONITOR AT SCALE Benjamin Fernandes @LotharSee,

    dotScale 2016
  2. LET’S USE AN EXAMPLE ▸ We are building a new

    dating app (for dogs)
 DATE-A-DOGE ▸ Let’s monitor it and watch its growth
  3. MILESTONE #1 ▸ Initial architecture ▸ Working BETA version

  4. FIRST RULE ▸ Measure everything! ▸ All layers are useful

    ▸ Application ▸ Web server, proxies ▸ DBs, caches ▸ 3rd party providers, …
  5. “IF YOU CAN’T MEASURE IT, YOU CAN’T IMPROVE IT”

  6. WATCH YOUR WORK METRICS FIRST ▸ Top-level health of your

    system by measuring its useful output ▸ Throughput ▸ Success ▸ Error ▸ Performance
  7. DIFFERENT KIND OF MONITORED DATA ▸ Resource metrics and events

    to investigate
  8. A FEW EXAMPLES ▸ web.latency ▸ web.errors ▸ api.calls ▸

    downloads ▸ dog.swipes ▸ system.load ▸ disk.used ▸ db.queries ▸ cache.latency ▸ net.sent ▸ new release ▸ commits ▸ 3rd party down ▸ tweets WORK RESOURCE EVENTS
  9. DASHBOARDS FOR EVERYTHING

  10. MILESTONE #2 ▸ Our app is now live ▸ First

    users… ▸ First outage!
  11. ALERTS BEFORE IT IS TOO LATE ▸ Create alerts on

    key metrics ▸ Do it on work metrics first ▸ Record, notify or page depending on the severity
  12. SYSTEMATIC INVESTIGATION ▸ From work metrics, dig into resource metrics

    ▸ Did something changed? Check events
  13. MILESTONE #3 ▸ Getting big ▸ 100+ servers ▸ 10k+

    users
  14. MONITORING HAS TO SCALE TOO ▸ Host-centric doesn’t work ▸

    Need to focus on services
  15. MULTIPLE DIMENSIONS WITH TAGS

  16. EXTRA SPACIAL DIMENSIONS

  17. WHAT WE CALL A METRIC

  18. USE TAGS, EVERYWHERE ▸ Graph and alert on tags ▸

    To scope and to aggregate ▸ Web latency per region < 50 ms on prod
  19. TAG WITH ANYTHING USEFUL ▸ Service ▸ Version ▸ Environment

    ▸ User ID ▸ API endpoint ▸ Dog breed
  20. SLICE AND DICE ▸ Same metric, different angles

  21. MILESTONE #4 ▸ More hosts! ▸ More services! ▸ Containers!

  22. METRIC CARDINALITY ▸ Don’t get lost ▸ Keep using tags,

    everywhere ▸ Stay focus on work metrics ▸ Monitoring gets sharper with experience
  23. METRIC AGGREGATION ▸ Watch min/avg/max ▸ Watch percentiles and outliers

    ▸ How fast is my app for the 95th percentile?
  24. NOW YOU HAVE ALL THE TOOLS ▸ Happy monitoring!