Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Shipping Metrics from the Edge

Shipping Metrics from the Edge

Computing is getting pushed to the edge, it may be your car, TV, washing machine, or your toaster. All these devices have a lot of computing power these days. While extending the cloud to the edge is getting solved with projects like KubeEdge or k3s, in this talk we want to take a closer look at how to run Prometheus on them. We want to configure Prometheus in a way that we can replicate its data to a central collecting point, that is running Thanos on Kubernetes in a replicated setup, and then make use of all the shipped metrics to efficiently query across the entire fleet.

Matthias Loibl

November 20, 2019
Tweet

More Decks by Matthias Loibl

Other Decks in Programming

Transcript

  1. @ThanosMetrics
    Shipping Metrics from the Edge
    Matthias Loibl, Software Engineer (Red Hat)
    [email protected]
    KubeCon + CNCF San Diego - November 19, 2019
    metalmatze

    View Slide

  2. @ThanosMetrics
    Matthias Loibl
    ▪ Software Engineer @ Red Hat
    ▪ OpenShift Monitoring Team
    ▪ OSS Contributor
    ○ Prometheus Operator
    ○ Thanos
    ○ gopass
    ▪ Meetup Organiser
    ○ Prometheus Berlin
    ▪ Hobbies
    ○ Playing Drums & Bass

    View Slide

  3. @ThanosMetrics
    Problem
    We have
    ● many air-gapped devices
    ● many applications monitored independently
    We want
    ● to know if specific device has problem
    ● to know how everything as a whole is doing
    ● to alert on problems on the device
    ● to alert on problems across multiple devices

    View Slide

  4. @ThanosMetrics
    Problem

    View Slide

  5. @ThanosMetrics
    Problem

    View Slide

  6. @ThanosMetrics
    Problem
    ?

    View Slide

  7. @ThanosMetrics
    Alerting
    Client
    ● Devices alert individually – they are provisioned with rules upon
    deployment
    ● Configured to send alerts to central Alertmanager with device ID
    Server
    ● Runs a cluster of Alertmanager
    ● Receives alerts and de-duplicates then notifies according to routing
    No alerting across all devices yet :/

    View Slide

  8. @ThanosMetrics
    Alertmanager
    Alerting
    ? Alertmanager

    View Slide

  9. @ThanosMetrics
    Idea #1
    ● Run Prometheus on each device
    ● Run reverse proxy on device for access or access via SSH

    View Slide

  10. @ThanosMetrics
    Idea #1

    View Slide

  11. @ThanosMetrics
    Idea #1
    ● Run Prometheus on each device
    ● Run reverse proxy on device for access or access via SSH
    Problems
    ● We expose things to the outside – might not even be possible
    ● No way of aggregating across devices easily
    ○ Could scrape them individually and then aggregate :/

    View Slide

  12. @ThanosMetrics
    Idea #2
    ● Run Prometheus on each device
    ● Run reverse proxy on device for access
    ● Run central Prometheus that federates from individual devices

    View Slide

  13. @ThanosMetrics
    Idea #2

    View Slide

  14. @ThanosMetrics
    Idea #2
    ● Run Prometheus on each device
    ● Run reverse proxy on device for access
    ● Run central Prometheus that federates from individual devices
    Improvements
    ● We can aggregate and query across all devices with central
    Prometheus
    Problems
    ● We expose things to the outside – might not even be possible
    ● We might only be able to federate every so often and lose samples

    View Slide

  15. @ThanosMetrics
    Idea #3
    ● Run Prometheus on each device
    ● Run app/client that collects metrics from Prometheus, then forwards
    those to central place via some protocol

    View Slide

  16. @ThanosMetrics
    Idea #3
    sidecar sidecar sidecar sidecar sidecar

    View Slide

  17. @ThanosMetrics
    Idea #3
    ● Run Prometheus on each device
    ● Run app/client that collects metrics from Prometheus, then forwards
    those to central place via some protocol
    Improvements
    ● We don’t expose Prometheus to the outside
    Problems
    ● Collect metrics from Prometheus every 1 min
    ○ Depending on target scrape interval of Prometheus we lose samples
    ● Most likely something custom-built

    View Slide

  18. @ThanosMetrics
    Idea #4
    ● Run Prometheus on each device
    ● Run the Thanos sidecar and upload written blocks to some central
    object storage

    View Slide

  19. @ThanosMetrics
    Idea #4
    Object
    Storage
    Query Store

    View Slide

  20. @ThanosMetrics
    Idea #4
    ● Run Prometheus on each device
    ● Run the Thanos sidecar and upload written blocks to some central
    object storage
    Problems
    ● We will have lag of metrics showing up, by default at least 2h until a
    block is written
    ● We need to expose object storage to devices (not too bad, but we
    can do better)

    View Slide

  21. @ThanosMetrics

    View Slide

  22. @ThanosMetrics
    Prometheus Remote Write
    ● Queue per destination
    ● Each queue reads from the write-ahead-log (WAL)
    ● Sends requests to configured endpoint
    ● Retries to send upon failure
    WAL
    Shard 1
    Shard ...
    Shard n
    endpoint
    endpoint
    endpoint

    View Slide

  23. @ThanosMetrics
    Prometheus Remote Write

    View Slide

  24. @ThanosMetrics
    Thanos Receive
    Monitoring Cluster
    Query
    ?

    View Slide

  25. @ThanosMetrics
    Thanos Receive
    Monitoring Cluster
    Query
    ?

    View Slide

  26. @ThanosMetrics
    Thanos Receive
    Monitoring Cluster
    Query
    ?

    View Slide

  27. @ThanosMetrics
    Thanos Receive
    ?
    ?
    ?
    ?
    ?
    ?

    View Slide

  28. @ThanosMetrics
    Thanos Receive
    ?
    ?
    ?
    ?
    ?
    ?

    View Slide

  29. @ThanosMetrics
    Thanos Receive
    ● Hashes are calculated from the entire label set
    ● Tenant’s ID will help to distribute the load across receivers
    ● Hashing function is xxHash, same as Prometheus

    View Slide

  30. @ThanosMetrics
    ?
    ?
    ?
    ?
    ?
    ?
    Thanos Receive
    Monitoring Cluster
    Query

    View Slide

  31. @ThanosMetrics
    ?
    ?
    ?
    ?
    ?
    Thanos Receive
    Monitoring Cluster
    Object
    Storage
    Query Store
    Receive

    View Slide

  32. @ThanosMetrics
    ?
    ?
    ?
    ?
    ?
    Alerting
    Object
    Storage
    Query
    Store
    Receive
    Rule

    View Slide

  33. @ThanosMetrics
    Thanos Receive Tenancy
    ?
    ?
    ?
    ?
    ?
    ?
    ?
    ?
    ?
    ?
    ?
    ?

    View Slide

  34. @ThanosMetrics
    Thanos Receive Controller
    ● Maps Tenants to individual hashrings / StatefulSets
    ○ Populates endpoints in configmap
    ○ Every Thanos receiver mounts same ConfigMap

    View Slide

  35. @ThanosMetrics
    DEMO
    https://github.com/metalmatze/talks/

    View Slide

  36. Demo

    View Slide

  37. Demo

    View Slide

  38. Demo

    View Slide

  39. @ThanosMetrics
    KubeCon + CloudNativeCon
    Thanos Deep Dive:
    Inside a Distributed Monitoring System
    ▪ Bartłomiej Płotka & Frederic Branczyk, RedHat
    ▪ Wednesday November 20, 2019
    ▪ 5:20pm (Room 6C)

    View Slide

  40. @ThanosMetrics
    Summary
    ● We can run lightweight monitoring on the edge
    ● Make use of Prometheus built-in replication
    ● Thanos Receive helps with air-gapped deployments
    ● Thanos Receive can scale to handle ingestion load
    ● Thanos Receive Controller helps with multi tenancy

    View Slide

  41. @ThanosMetrics
    https://www.katacoda.com/thanos

    View Slide

  42. @ThanosMetrics
    Work with us
    Red Hat’s Observability Team is hiring!
    https://global-redhat.icims.com/jobs/74508/principal-software-engineer---prometheus/job

    View Slide

  43. @ThanosMetrics
    Thank You!
    https://thanos.io

    View Slide