Uber-scale monitoring for everyone with m3_Matt Schallert_Codemotion Berlin 2019

Uber-Scale Monitoring for Everyone with M3 Matt Schallert - Chronosphere
12-13 November, 2019 @mattschallert

2 @mattschallert 2 • Then: Senior SRE @ Uber ◦
3 years on Observability team ◦ Scaled metrics + monitoring stack ◦ Helped release M3, OSS metrics platform • Now: Engineer at Chronosphere Background

3 @mattschallert 3 • Monitoring over the years • New
challenges • How M3 addresses • How to use M3 • Summary Agenda

4 @mattschallert 4 • Metric ◦ Measurement emitted by your
application / infrastructure ◦ “Request latency”, “# of errors”, “CPU usage” ◦ Often stored in a database for analysis Deﬁnitions

5 @mattschallert 5 • Monitoring ◦ Continuously checking data (logs,
metrics) ◦ Take action when something is “wrong” ◦ Alerting often done on aggregates, metrics most efﬁcient Deﬁnitions

6 @mattschallert Why Monitoring?

7 @mattschallert Why Monitoring?

8 Monitoring has changed

9 Monitoring has changed Infrastructure has changed

10 @mattschallert 10 • Knowing if your application was healthy
used to be simpler Infrastructure Has Changed • Applications have become more complex

11 @mattschallert 11 • Applications had less moving parts •
Health often determined by “is this process up” and status codes Infrastructure Has Changed • Microservices!

12 @mattschallert 12 • VMs + Nagios = Monitoring? Infrastructure
Has Changed • Cloud native, containers, serverless • More managed services

13 @mattschallert 13 • Lower user expectations • Smaller infrastructure
footprint Infrastructure Has Changed • Apps generate more data, have more features • More zones, regions, providers

14 @mattschallert • Most common: Prometheus ◦ All-in-one: gather, query,
alert ◦ Easy to get started • Ecosystem: client emission + server-side collection ◦ https://prometheus.io/docs/instrumenting/exporters/ ◦ JMX exporter, SNMP, etc. • Discovers your applications How has monitoring accounted for this?

15 @mattschallert • Prometheus is a great single-node solution ◦
(Should almost always start with it) • Long term + scale-out storage are out of scope • Similar constraints with other popular solutions Prometheus: Limitations

16 @mattschallert Challenges of Single-Node

21 @mattschallert • Open source metrics platform • Built to
meet Uber’s scaling needs ◦ Hundreds of millions of samples per-second ◦ Billions of timeseries stored ◦ Billions of datapoints queried per-second M3

22 @mattschallert Cloud Region #N Cloud Region #0 M3 Query
Aggregation M3 Coordinator Prometheus Graphite Grafana (PromQL, Graphite) Alerting Engines M3DB M3DB M3DB M3 What is M3?

23 @mattschallert • M3DB shards + replicates data • More
instances ⇨ more capacity • Losing an instance doesn’t impact monitoring • “Single pane of glass” ◦ Query all your data from one place M3: Scale-Out Platform

24 @mattschallert • Simplify operations ◦ No manual sharding of
data • Ingest once, store however you want ◦ Short term, high-resolution data (performance investigations) ◦ Long term, aggregated data (capacity planning) M3: Scale-Out Platform

25 @mattschallert M3: Scale-Out Platform

29 @mattschallert • M3 makes few assumptions about how it’s
deployed M3: Flexible Deployment Model

30 @mattschallert Single-Zone

31 @mattschallert Single-Zone

32 @mattschallert Multi-Zone

33 @mattschallert Multi-Zone

34 @mattschallert Multi-Region

35 @mattschallert • Cloud: single AZ, multi-AZ, multi-region • On-premise
• Kubernetes support ◦ M3DB Operator ◦ Translate user conﬁgurations into one of various deployment methods M3: Flexible Deployment Model

36 @mattschallert M3 on Kubernetes

37 @mattschallert M3 on Kubernetes

38 @mattschallert M3: Native Integrations

39 @mattschallert • Support Prometheus and Carbon ingestion • PromQL
and Graphite querying • Keep single pane of glass M3: Native Integrations

40 @mattschallert Value: Single Metrics Platform

47 @mattschallert • Open your data to your entire organization
◦ Product teams can monitor feature rollouts ◦ Operations teams can monitor business metrics • Can monitor and link failures across different domains Single Metrics Platform

48 @mattschallert Single Metrics Platform

49 @mattschallert • Richer context • Ex: more robust capacity
planning ◦ Service-to-service metrics collocated with infrastructure utilization ◦ Can feed in business forecasting • Long-term (1-5 years) historical lookbacks ◦ “Am I getting more efﬁcient at running my business?” Single Metrics Platform

50 @mattschallert • Entirely OSS • Determine your needs +
how to best leverage it • Prometheus, Graphite integrations out of the box • Building blocks for higher level use cases (designed as a platform) ◦ Ex: capacity planning How Can I Use M3?

51 @mattschallert • Applications + monitoring requirements have changed •
M3 helps address needs, provides scalable metrics solution for globally distributed applications • Single metrics platform, leverage data beyond just monitoring Summary

52 @mattschallert • m3db.io • github.com/m3db/m3 • chronosphere.io Thank You!
(+ Q&A)

Uber-scale monitoring for everyone with m3_Matt...

Uber-scale monitoring for everyone with m3_Matt Schallert_Codemotion Berlin 2019

More Decks by Codemotion

Other Decks in Programming

Featured

Transcript