Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secrets to Monitor Kubernetes Workloads

Prathamesh Sonpatki
February 04, 2024
65

Secrets to Monitor Kubernetes Workloads

Monitoring systems remain a critical challenge in the ever-evolving landscape of Kubernetes workloads. It is no longer a second thought and goes hand in hand with development and deployment. Throw in the challenge of data growth, high cardinality and the complexities multifold, not just in data growth management but also in fatigue and cost.

Join us as we delve into the trenches of Kubernetes observability, uncovering the secrets to monitor your workloads masterfully using Open Standards. Our journey will explore advanced techniques, addressing real-world challenges people encounter daily with trade-offs from the trenches and lessons learned while building one of the biggest monitoring platforms.

Prathamesh Sonpatki

February 04, 2024
Tweet

Transcript

  1. whoami - Prathamesh Sonpatki - Developer turned Developer Evangelist at

    Last9 - SRE Stories - O11y.wiki - Twitter, Linkedin Supported By
  2. Agenda - Why should you care? - Why “Monitoring”? -

    Prometheus and Kubernetes - Challenges of Monitoring Kubernetes Workloads - Some Cool Stuff Supported By
  3. Why should you care? Kubernetes is de-facto standard managing and

    deploying modern microservices based apps. Supported By
  4. Monitoring Kubernetes Workloads Who is monitoring? - DevOps / SRE

    / Operators - App Developers - Product - Business Stakeholders Supported By
  5. Ideal Monitoring Wishlist* Supported By Pre-Processing Pipelines Instrumentation Long Term

    Storage Faster Access Standard Query Interface Standard Dashboards Workload Isolation Relevant Context Automation Coverage Signal vs. Noise Ingestion Storage Query Alerting
  6. Monitoring Kubernetes Workloads (at bare minimum…) - Monitoring Kubernetes itself

    Supported By - Monitoring Apps, Services, Jobs running in Kubernetes
  7. Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format

    for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By
  8. Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format

    for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By
  9. Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format

    for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By
  10. Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format

    for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By
  11. Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format

    for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By
  12. Monitoring Apps, Services on Kubernetes Supported By - The challenge

    of Service Discovery and Ephemeral resources
  13. Prometheus ❤ Kubernetes - Kubernetes was inspired from Borg, Google’s

    container orchestration system - Prometheus was inspired from Borgman, a system to monitor services on Borg. - Both are CNCF graduated projects. - Supported By
  14. Monitoring Kubernetes workloads using Prometheus Supported By OpenMetrics Format Text

    Exposition Format Pull based model Instrumentation Multidimensiona l data model Long Term Storage PromQL Grafana Built-in visualization Standard Dashboards AlertManager Silencers Inhibit Rules Notification Destinations Service Discovery Standard K8S Metrics Prometheus Exporters Ingestion Storage Query Alerting
  15. Setting up Kubernetes Monitoring using Prometheus - Avoid Google Search

    - Just use “kube-prometheus” Supported By
  16. Challenges in monitoring Kubernetes Workloads Supported By ? Instrumentation ?

    ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting
  17. Ingestion - Ingesting large volumes of data is rarely a

    problem. - As long as reads are not significant. - As long as system is not already primed. - As long as storage is not for long durations. Supported By
  18. Ingestion The High Cardinality Challenge 4 (Clusters) x 1000 (Service)

    x 3 (Environment) x 10,000 (Pods) = 12,000,000 🙀 Supported By https://www.foomo.org/blog/prometheus- cardinality-issues
  19. Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel

    ✅ Rename ✅ Cardinality 😭 Compute 😭 Cost 😭 Instrumentation ? ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting
  20. Storage Supported By - Millions of metrics but mostly unused.

    - Data exists but not accessible when needed. - Lack of workload isolation - Alerting needs fresh data, SLOs need data on longer horizon. - Can do remote write / Thanos / Levitate
  21. Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel

    ✅ Rename ✅ Cardinality 😭 Compute 😭 Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting
  22. Query Supported By - You are fine as long as

    you are not reading. - Storage - Single Point of Failure
  23. Query Supported By - You are fine as long as

    you are not reading. - Storage - Single Point of Failure - Compute cost skyrockets. - Solution: Limit instrumentation / Limit query range. - Lack of traffic shaping.
  24. Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel

    ✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting
  25. Alerting Supported By Dark Side of Sane Defaults - High

    Cardinality Labels in alerts and dashboards - Latent Dashboards - Invalid Alert Rules
  26. Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel

    ✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 AlertManager ✅ Destinations ✅ Good Defaults 😐 No anomaly 😭 Basic Viz 😭 Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting
  27. Security - Prom Operator runs on the cluster itself! -

    Automatically gets right creds - Can reach targets without extra networking permissions Supported By
  28. Autoscale based on metrics - Did you know, you can

    connect HPA to Metrics? - Use prometheus-adapter Supported By
  29. Streaming Aggregations - Run Aggregations before data is stored to

    reduce cardinality. - Ignore pods, instances when not needed. Supported By
  30. Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel

    ✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 AlertManager ✅ Destinations ✅ Good Defaults ✅ No anomaly 😭 Basic Viz 😭 Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting