Metrics at Uber, Monitorama 2018

Metrics at Uber Prateek Rungta (@prateekrungta) Engineer, M3 Team Learnings,
a few neat Observability Patterns and our OSS metrics platform

Small Technology company

Uber’s Architecture & Metrics - ~4K Microservices - Central Observability
platform, focus on Metrics today - Tracing: Yuri’s talk about Jaeger, Monitorama 2017 - Used for all manner of things - Capacity Planning using System Metrics (e.g. Load Average) - Real-time Alerting using Application metrics (e.g. p99 response time for ride requests) - Tracking business metrics (e.g. number of UberX riders in Portland) - … and plenty more …

Developers! Developers! Developers! func myRPCHandler(param int, m MetricScope) { …
t := m.Timer(“latency”).Start() responseCode := client.Call(param) t.Stop() m.Tagged(map[string]string{“code”:responseCode}).Counter(“response”).Inc(1) }

Queries

Discoverability

“Golden Signals” Usually you want the same telemetry - SRE
Book: Latency, traffic, errors, and saturation - USE Method: Utilisation, saturation, and errors - RED Method: Rate, errors, and duration - Shout out for Baron-Schwartz’s work: video

Dynamic Dashboards

Service

End User Code (“Biz logic”) RPC Storage (C*/Redis/…) ...

End User Code (“Biz logic”) RPC Storage (C*/Redis/…) ... Library
owners: - Dashboard panel template = f(serviceName) - Ensure library emits metrics following given template Application devs: - Service: uses library - Provide “serviceName” at time of generation

Auto Alerting Jamie Wilkinson, Monitorama 2018

- Grafana : Dynamic Dashboard :: Manual Configured Alerts :
? - E.g. Detect anomalies in latency per RPC endpoint Auto Alerting

Alert Storms

Alert Storms Alert Storms Grouped & Dependent Alerts Single contextual
notification

- Remediate alerts around deployments/configurations changes Auto Rollback and Remediation

What’s so hard about that?

Scale - Ingress 400-600M Pre-aggregated Metrics/s (~130Gbits/sec) (random week when
I was making these slides)

Scale - Ingress ~20M Metrics Stored/s (~50Gbits/sec) (random week when
I was making these slides)

Scale - Ingress ~ 6B Unique Metric IDs (random week
when I was making these slides)

Scale - Egress ~ 2.2K Queries per second (9K Grafana
Dashboards, 150K Realtime Alerts) (random week when I was making these slides)

Scale - Egress ~ 30B Datapoints per second (~20Gbits/sec) (random
week when I was making these slides)

- Persisted Metrics: 20% uptick in the last quarter -
Unique IDs: 50% uptick in the last half year - QPS: 100% uptick in the last year - Ingress Traffic: 900x in the last 3 years Constantly growing

A brief history of M3 - 2014-2015: Graphite - No
replication, operations were ‘cumbersome’ - 2015-2016: Cassandra - 16x YoY Growth - Expensive (>1500 Cassandra hosts) - “Technology Telemetry company” - Compactions ⇒ RF=2 ⇒ Repairs too slow - 2016-Today: M3DB

M3DB A open source distributed time series database - Store
arbitrary timestamp precision datapoints at any resolution for any retention - Optimized file-system storage with no need for compactions - Replicated with zone/rack aware layout and configurable replication factor - Strongly consistent cluster membership backed by etcd - Fast streaming for node add/replace/remove by selecting best peer for a series while also repairing any mismatching series at time of streaming

TSZ Timestamp Compression Gorilla

- m3tsz = tsz + improvements - More details to
follow in a blog, for the curious – https://github.com/m3db/m3db/tree/master/src/dbnode/encoding/m3tsz M3TSZ Overview TSZ M3TSZ Improvement Number of bytes / datapoint 2.42 Compression ratio 6.56x Encoding time (ns) / datapoint 338 Decoding time (ns) / datapoint 347 1.45 40% 11x 40% 298 12% 300 14% These results apply the two different algorithms on Uber’s production data

M3TSZ Impact - Data volumes at time of migration end
of 2016 ◦ Disk usage ~ 1.4PB for Cassandra at RF=2 ◦ Disk usage ~ 200TB for M3DB at RF=3

M3DB Logical Constructs

M3DB Architecture

Persistence • For each incoming write ◦ Data is stored
in memory in compressed ‘n’-hour blocks, ◦ Data is appended to commit log on disk (think WAL), • We periodically write the compressed blocks to disk as immutable fileset files (think Snapshot file)

Layout on Disk ────────────────────────────── Time ──────────────────────────────────────── ▶ ┌──────────────────────────┐ │/var/lib/m3db/commitlogs/ │
└───────────────────┬──────┴─┬────────┬────────┬────────┬────────┬────────┐ │Commit │Commit │Commit │Commit │Commit │Commit │ │Log File│Log File│Log File│Log File│Log File│Log File│ └────────┴────────┴────────┴────────┴────────┴────────┘ ┌──────────────────────────────────────┐ │/var/lib/m3db/data/namespace-a/shard-0│ └───────────────────┬───────────────┬──┴────────────┬───────────────┬───────────────┐ │Fileset File │Fileset File │Fileset File │Fileset File │ │Block │Block │Block │Block │ └───────────────┴───────────────┴───────────────┴───────────────┘ ┌──────────────────────────────────────┐ │/var/lib/m3db/index/namespace-a │ └───────────────────┬──────────────────┴────────────┬───────────────────────────────┐ │Index Fileset File │Index Fileset File │ │Block │Block │ └───────────────────────────────┴───────────────────────────────┘

Filesets Files • Data is flushed from memory to disk
every ‘n’ hours as block filesets • Two flavours: ◦ Data fileset blocks contain compressed time-series data (m3tsz) ◦ Index fileset blocks contain compressed reverse-indexing data (FSTs/Postings Lists/etc) • Expired block filesets are periodically cleaned up in the background

Commit Log • Uncompressed • Support sync and async writes
◦ Async for performance: buffer in memory & periodically flush batches

- Strongly consistent topology (using etcd) - Consistency managed via
synchronous quorum writes and reads - Configurable consistency level - No hinted hand-off - Nodes bootstrap from peers at startup/topology-change Topology & Consistency

• Increased replication ◦ 2 -> 3x replication factor •
Read Performance Improvements(p50/p95/p99): - C*: 8ms / 270ms / 500ms - M3DB: 0.2ms / 0.35ms / 5ms • Cheaper(!) M3DB Impact

Read cache Write cache What’s production look like today? Host
Collector Client Client Host Collector Client Client Host Collector Client Client Query Service M3DB M3DB ES 5.x Aggregation Tier Indexer M3DB Ingester (per region)

OSS: Why?

OSS & Prometheus Integration Prometheus Grafana M3DB AlertManager M3DB Index
Coordinator etcd

- Coordinator & Index used in smaller deployments - Feature
work to use with Multiple M3DB Cluster deployments (like Uber’s production usage) - Index Read Performance Improvements Caveat Emptor Index Coordinator

Where - All development on: github.com/m3db/m3db - Apache v2 -
Contributions welcome! - Documentation: http://bit.ly/m3db-docs - Reach us via: http://bit.ly/m3db-forums

What’s to come - M3DB: - Lookout for a blog
post to drop in July - Ability to backfill data - Index Performance + Multi-clustered Index - Graphite Support for M3Coordinator - … and plenty more … - Aggregator: github.com/m3db/m3aggregator - Packaging, Documentation, etc. - Query Engine (and Query Language) - … and plenty more …

Work of a Team

- Code: github.com/m3db/m3db - Docs: http://bit.ly/m3db-docs - Forum: http://bit.ly/m3db-forums -
Slides: http://bit.ly/m3db-monitorama2018 Thank you! @prateekrungta

Metrics at Uber, Monitorama 2018

Metrics at Uber, Monitorama 2018

Other Decks in Technology

Featured

Transcript