Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Uber-scale monitoring for everyone with m3_Matt...

Codemotion
November 12, 2019

Uber-scale monitoring for everyone with m3_Matt Schallert_Codemotion Berlin 2019

The way we run software has changed dramatically over the past few years, and as such, so has the way we monitor our applications. With the advantages of the cloud and other new technologies comes complexity that our monitoring systems must account for. In this talk, Matt will discuss the evolution of such monitoring systems that led Uber to build M3, an open-source metrics platform. The talk will show how the community can use M3 to leverage Uber’s years of experience monitoring complex globally distributed systems, and integration with existing tools such as Prometheus and Graphite.

About: Matt Schallert, Senior Software Engineer - Chronosphere

Matt is a Senior Software Engineer at Chronosphere and works on M3, an open source metrics platform. Recently, his efforts have been focused on improving the operational experience for users of M3. Previously, Matt was a Senior Site Reliability Engineer at Uber where he helped launch M3, and prior to that he was an SRE at Tumblr. In his spare time, Matt can be found hiking, skiing, and building data centers in his apartment.

Codemotion

November 12, 2019
Tweet

More Decks by Codemotion

Other Decks in Programming

Transcript

  1. 2 @mattschallert 2 • Then: Senior SRE @ Uber ◦

    3 years on Observability team ◦ Scaled metrics + monitoring stack ◦ Helped release M3, OSS metrics platform • Now: Engineer at Chronosphere Background
  2. 3 @mattschallert 3 • Monitoring over the years • New

    challenges • How M3 addresses • How to use M3 • Summary Agenda
  3. 4 @mattschallert 4 • Metric ◦ Measurement emitted by your

    application / infrastructure ◦ “Request latency”, “# of errors”, “CPU usage” ◦ Often stored in a database for analysis Definitions
  4. 5 @mattschallert 5 • Monitoring ◦ Continuously checking data (logs,

    metrics) ◦ Take action when something is “wrong” ◦ Alerting often done on aggregates, metrics most efficient Definitions
  5. 10 @mattschallert 10 • Knowing if your application was healthy

    used to be simpler Infrastructure Has Changed • Applications have become more complex
  6. 11 @mattschallert 11 • Applications had less moving parts •

    Health often determined by “is this process up” and status codes Infrastructure Has Changed • Microservices!
  7. 12 @mattschallert 12 • VMs + Nagios = Monitoring? Infrastructure

    Has Changed • Cloud native, containers, serverless • More managed services
  8. 13 @mattschallert 13 • Lower user expectations • Smaller infrastructure

    footprint Infrastructure Has Changed • Apps generate more data, have more features • More zones, regions, providers
  9. 14 @mattschallert • Most common: Prometheus ◦ All-in-one: gather, query,

    alert ◦ Easy to get started • Ecosystem: client emission + server-side collection ◦ https://prometheus.io/docs/instrumenting/exporters/ ◦ JMX exporter, SNMP, etc. • Discovers your applications How has monitoring accounted for this?
  10. 15 @mattschallert • Prometheus is a great single-node solution ◦

    (Should almost always start with it) • Long term + scale-out storage are out of scope • Similar constraints with other popular solutions Prometheus: Limitations
  11. 21 @mattschallert • Open source metrics platform • Built to

    meet Uber’s scaling needs ◦ Hundreds of millions of samples per-second ◦ Billions of timeseries stored ◦ Billions of datapoints queried per-second M3
  12. 22 @mattschallert Cloud Region #N Cloud Region #0 M3 Query

    Aggregation M3 Coordinator Prometheus Graphite Grafana (PromQL, Graphite) Alerting Engines M3DB M3DB M3DB M3 What is M3?
  13. 23 @mattschallert • M3DB shards + replicates data • More

    instances ⇨ more capacity • Losing an instance doesn’t impact monitoring • “Single pane of glass” ◦ Query all your data from one place M3: Scale-Out Platform
  14. 24 @mattschallert • Simplify operations ◦ No manual sharding of

    data • Ingest once, store however you want ◦ Short term, high-resolution data (performance investigations) ◦ Long term, aggregated data (capacity planning) M3: Scale-Out Platform
  15. 35 @mattschallert • Cloud: single AZ, multi-AZ, multi-region • On-premise

    • Kubernetes support ◦ M3DB Operator ◦ Translate user configurations into one of various deployment methods M3: Flexible Deployment Model
  16. 39 @mattschallert • Support Prometheus and Carbon ingestion • PromQL

    and Graphite querying • Keep single pane of glass M3: Native Integrations
  17. 47 @mattschallert • Open your data to your entire organization

    ◦ Product teams can monitor feature rollouts ◦ Operations teams can monitor business metrics • Can monitor and link failures across different domains Single Metrics Platform
  18. 49 @mattschallert • Richer context • Ex: more robust capacity

    planning ◦ Service-to-service metrics collocated with infrastructure utilization ◦ Can feed in business forecasting • Long-term (1-5 years) historical lookbacks ◦ “Am I getting more efficient at running my business?” Single Metrics Platform
  19. 50 @mattschallert • Entirely OSS • Determine your needs +

    how to best leverage it • Prometheus, Graphite integrations out of the box • Building blocks for higher level use cases (designed as a platform) ◦ Ex: capacity planning How Can I Use M3?
  20. 51 @mattschallert • Applications + monitoring requirements have changed •

    M3 helps address needs, provides scalable metrics solution for globally distributed applications • Single metrics platform, leverage data beyond just monitoring Summary