Shipping Metrics from the Edge

@ThanosMetrics Shipping Metrics from the Edge Matthias Loibl, Software Engineer
(Red Hat) [email protected] KubeCon + CNCF San Diego - November 19, 2019 metalmatze

@ThanosMetrics Matthias Loibl ▪ Software Engineer @ Red Hat ▪
OpenShift Monitoring Team ▪ OSS Contributor ◦ Prometheus Operator ◦ Thanos ◦ gopass ▪ Meetup Organiser ◦ Prometheus Berlin ▪ Hobbies ◦ Playing Drums & Bass

@ThanosMetrics Problem We have • many air-gapped devices • many
applications monitored independently We want • to know if specific device has problem • to know how everything as a whole is doing • to alert on problems on the device • to alert on problems across multiple devices

@ThanosMetrics Problem

@ThanosMetrics Problem ?

@ThanosMetrics Alerting Client • Devices alert individually – they are
provisioned with rules upon deployment • Configured to send alerts to central Alertmanager with device ID Server • Runs a cluster of Alertmanager • Receives alerts and de-duplicates then notifies according to routing No alerting across all devices yet :/

@ThanosMetrics Alertmanager Alerting ? Alertmanager

@ThanosMetrics Idea #1 • Run Prometheus on each device •
Run reverse proxy on device for access or access via SSH

@ThanosMetrics Idea #1

Run reverse proxy on device for access or access via SSH Problems • We expose things to the outside – might not even be possible • No way of aggregating across devices easily ◦ Could scrape them individually and then aggregate :/

Run reverse proxy on device for access • Run central Prometheus that federates from individual devices

@ThanosMetrics Idea #2

Run reverse proxy on device for access • Run central Prometheus that federates from individual devices Improvements • We can aggregate and query across all devices with central Prometheus Problems • We expose things to the outside – might not even be possible • We might only be able to federate every so often and lose samples

Run app/client that collects metrics from Prometheus, then forwards those to central place via some protocol

@ThanosMetrics Idea #3 sidecar sidecar sidecar sidecar sidecar

Run app/client that collects metrics from Prometheus, then forwards those to central place via some protocol Improvements • We don’t expose Prometheus to the outside Problems • Collect metrics from Prometheus every 1 min ◦ Depending on target scrape interval of Prometheus we lose samples • Most likely something custom-built

Run the Thanos sidecar and upload written blocks to some central object storage

@ThanosMetrics Idea #4 Object Storage Query Store

Run the Thanos sidecar and upload written blocks to some central object storage Problems • We will have lag of metrics showing up, by default at least 2h until a block is written • We need to expose object storage to devices (not too bad, but we can do better)

@ThanosMetrics

@ThanosMetrics Prometheus Remote Write • Queue per destination • Each
queue reads from the write-ahead-log (WAL) • Sends requests to configured endpoint • Retries to send upon failure WAL Shard 1 Shard ... Shard n endpoint endpoint endpoint

@ThanosMetrics Prometheus Remote Write

@ThanosMetrics Thanos Receive Monitoring Cluster Query ?

@ThanosMetrics Thanos Receive ? ? ? ? ? ?

@ThanosMetrics Thanos Receive • Hashes are calculated from the entire
label set • Tenant’s ID will help to distribute the load across receivers • Hashing function is xxHash, same as Prometheus

@ThanosMetrics ? ? ? ? ? ? Thanos Receive Monitoring
Cluster Query

@ThanosMetrics ? ? ? ? ? Thanos Receive Monitoring Cluster
Object Storage Query Store Receive

@ThanosMetrics ? ? ? ? ? Alerting Object Storage Query
Store Receive Rule

@ThanosMetrics Thanos Receive Tenancy ? ? ? ? ? ?
? ? ? ? ? ?

@ThanosMetrics Thanos Receive Controller • Maps Tenants to individual hashrings
/ StatefulSets ◦ Populates endpoints in configmap ◦ Every Thanos receiver mounts same ConfigMap

@ThanosMetrics DEMO https://github.com/metalmatze/talks/

@ThanosMetrics KubeCon + CloudNativeCon Thanos Deep Dive: Inside a Distributed
Monitoring System ▪ Bartłomiej Płotka & Frederic Branczyk, RedHat ▪ Wednesday November 20, 2019 ▪ 5:20pm (Room 6C)

@ThanosMetrics Summary • We can run lightweight monitoring on the
edge • Make use of Prometheus built-in replication • Thanos Receive helps with air-gapped deployments • Thanos Receive can scale to handle ingestion load • Thanos Receive Controller helps with multi tenancy

@ThanosMetrics https://www.katacoda.com/thanos

@ThanosMetrics Work with us Red Hat’s Observability Team is hiring!
https://global-redhat.icims.com/jobs/74508/principal-software-engineer---prometheus/job

@ThanosMetrics Thank You! https://thanos.io

Shipping Metrics from the Edge

Shipping Metrics from the Edge

More Decks by Matthias Loibl

Other Decks in Programming

Featured

Transcript