Secrets to Monitor Kubernetes Workloads

Supported By Secrets to Monitor Kubernetes Workloads Prathamesh Sonpatki Developer
Evangelist, Last9

whoami Supported By ഹേലാ

whoami - Prathamesh Sonpatki - Developer turned Developer Evangelist at
Last9 - SRE Stories - O11y.wiki - Twitter, Linkedin Supported By

Supported By last9.io/friends

Agenda - Why should you care? - Why “Monitoring”? -
Prometheus and Kubernetes - Challenges of Monitoring Kubernetes Workloads - Some Cool Stuff Supported By

Why should you care? Kubernetes is de-facto standard managing and
deploying modern microservices based apps. Supported By

Why should you care? A big box of interconnected components.
Supported By

Why should you care? - Fun to learn new things!
Supported By

What is Monitoring? Supported By https://last9.io/blog/radar-and-black-boxes-for-software-observability/ vs.

What is Monitoring? Supported By

Monitoring Kubernetes Workloads Who is monitoring? - DevOps / SRE
/ Operators - App Developers - Product - Business Stakeholders Supported By

Monitoring Kubernetes Workloads When they are monitoring? - War Time
- Peace Time Supported By

Ideal Monitoring Wishlist* Supported By Pre-Processing Pipelines Instrumentation Long Term
Storage Faster Access Standard Query Interface Standard Dashboards Workload Isolation Relevant Context Automation Coverage Signal vs. Noise Ingestion Storage Query Alerting

Monitoring Kubernetes Workloads (at bare minimum…) - Monitoring Kubernetes itself
Supported By - Monitoring Apps, Services, Jobs running in Kubernetes

Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format
for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By

Monitoring Apps, Services on Kubernetes Supported By

Monitoring Apps, Services on Kubernetes Supported By - The challenge
of Service Discovery and Ephemeral resources

Monitoring Apps, Services on Kubernetes Supported By

Monitoring Kubernetes using Prometheus Supported By

Prometheus ❤ Kubernetes - Kubernetes was inspired from Borg, Google’s
container orchestration system - Prometheus was inspired from Borgman, a system to monitor services on Borg. - Both are CNCF graduated projects. - Supported By

Monitoring Kubernetes workloads using Prometheus Supported By https://sysdig.com/blog/kubernetes-monitoring-prometheus/

Monitoring Kubernetes workloads using Prometheus Supported By OpenMetrics Format Text
Exposition Format Pull based model Instrumentation Multidimensiona l data model Long Term Storage PromQL Grafana Built-in visualization Standard Dashboards AlertManager Silencers Inhibit Rules Notification Destinations Service Discovery Standard K8S Metrics Prometheus Exporters Ingestion Storage Query Alerting

Challenges in monitoring Kubernetes Workloads Supported By ? Instrumentation ?
? ? ? Ingestion Storage Query Alerting

Let’s roll this Supported By

Setup & Instrumentation Supported By

Setting up Kubernetes Monitoring using Prometheus - Avoid Google Search
- Just use “kube-prometheus” Supported By

Setting up Kubernetes Monitoring using Prometheus Supported By

Setting up Kubernetes Monitoring using Prometheus Supported By https://training.promlabs.com/t raining/prometheus-and-kuber
netes/prometheus-operator-int roduction/operator-architectur e

Challenges in monitoring Kubernetes Workloads Supported By ? Instrumentation ?
? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Ingestion Supported By

Ingestion - Ingesting large volumes of data is rarely a
problem. - As long as reads are not significant. - As long as system is not already primed. - As long as storage is not for long durations. Supported By

Ingestion The High Cardinality Challenge 4 (Clusters) x 1000 (Service)
x 3 (Environment) x 10,000 (Pods) = 12,000,000 🙀 Supported By https://www.foomo.org/blog/prometheus- cardinality-issues

Solving High Cardinality Downsampling Supported By

Solving High Cardinality Drop labels Supported By

It is difficult to make predictions, especially about the future
Supported By

Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel
✅ Rename ✅ Cardinality 😭 Compute 😭 Cost 😭 Instrumentation ? ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Storage Supported By

Storage Supported By - Millions of metrics but mostly unused.
- Data exists but not accessible when needed. - Lack of workload isolation - Alerting needs fresh data, SLOs need data on longer horizon. - Can do remote write / Thanos / Levitate

✅ Rename ✅ Cardinality 😭 Compute 😭 Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Query Supported By

Query Supported By - You are fine as long as
you are not reading. - Storage - Single Point of Failure

Query Supported By - You are fine as long as
you are not reading. - Storage - Single Point of Failure - Compute cost skyrockets. - Solution: Limit instrumentation / Limit query range. - Lack of traffic shaping.

✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Alerting Supported By

Alerting Supported By - Fatigue - Signal vs. Noise -
Historical Trends

Alerting Supported By - Thresholds - Seasonality

Alerting Supported By Sane Defaults https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes

Alerting Supported By Dark Side of Sane Defaults - High
Cardinality Labels in alerts and dashboards - Latent Dashboards - Invalid Alert Rules

✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 AlertManager ✅ Destinations ✅ Good Defaults 😐 No anomaly 😭 Basic Viz 😭 Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Security Supported By

Security - Prom Operator runs on the cluster itself! -
Automatically gets right creds - Can reach targets without extra networking permissions Supported By

Cool Stuff Supported By

Reduce Kubernetes Metrics - Drop everything not used by Dashboards
and Alerts. - Mimirtool Supported By

Autoscale based on metrics - Did you know, you can
connect HPA to Metrics? - Use prometheus-adapter Supported By

Streaming Aggregations - Run Aggregations before data is stored to
reduce cardinality. - Ignore pods, instances when not needed. Supported By

✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 AlertManager ✅ Destinations ✅ Good Defaults ✅ No anomaly 😭 Basic Viz 😭 Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Supported By നന്ദി llast9.io/levitate-tsdb Prathamesh Sonpatki Developer Evangelist, Last9

Secrets to Monitor Kubernetes Workloads

Secrets to Monitor Kubernetes Workloads

More Decks by Prathamesh Sonpatki

Featured

Transcript