Agenda
- Why should you care?
- Why “Monitoring”?
- Prometheus and Kubernetes
- Challenges of Monitoring Kubernetes Workloads
- Some Cool Stuff
Supported By
Slide 6
Slide 6 text
Why should you care?
Kubernetes is de-facto standard managing and deploying modern microservices
based apps.
Supported By
Slide 7
Slide 7 text
Why should you care?
A big box of interconnected components.
Supported By
Slide 8
Slide 8 text
Why should you care?
- Fun to learn new things!
Supported By
Slide 9
Slide 9 text
What is Monitoring?
Supported By
https://last9.io/blog/radar-and-black-boxes-for-software-observability/
vs.
Slide 10
Slide 10 text
What is Monitoring?
Supported By
Slide 11
Slide 11 text
Monitoring Kubernetes Workloads
Who is monitoring?
- DevOps / SRE / Operators
- App Developers
- Product
- Business Stakeholders
Supported By
Slide 12
Slide 12 text
Monitoring Kubernetes Workloads
When they are monitoring?
- War Time
- Peace Time
Supported By
Slide 13
Slide 13 text
Ideal Monitoring Wishlist*
Supported By
Pre-Processing
Pipelines
Instrumentation
Long Term
Storage
Faster Access
Standard Query
Interface
Standard
Dashboards
Workload
Isolation
Relevant
Context
Automation
Coverage
Signal vs. Noise
Ingestion Storage Query Alerting
Slide 14
Slide 14 text
Monitoring Kubernetes Workloads (at bare minimum…)
- Monitoring Kubernetes itself
Supported By
- Monitoring Apps, Services, Jobs running in
Kubernetes
Slide 15
Slide 15 text
Monitoring Kubernetes itself
- Kubernetes exposes Metrics in OpenMetrics format for different components
https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
Supported By
Slide 16
Slide 16 text
Monitoring Kubernetes itself
- Kubernetes exposes Metrics in OpenMetrics format for different components
https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
Supported By
Slide 17
Slide 17 text
Monitoring Kubernetes itself
- Kubernetes exposes Metrics in OpenMetrics format for different components
https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
Supported By
Slide 18
Slide 18 text
Monitoring Kubernetes itself
- Kubernetes exposes Metrics in OpenMetrics format for different components
https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
Supported By
Slide 19
Slide 19 text
Monitoring Kubernetes itself
- Kubernetes exposes Metrics in OpenMetrics format for different components
https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/
Supported By
Slide 20
Slide 20 text
Monitoring Apps, Services on Kubernetes
Supported By
Slide 21
Slide 21 text
Monitoring Apps, Services on Kubernetes
Supported By
- The challenge of Service Discovery and Ephemeral resources
Slide 22
Slide 22 text
Monitoring Apps, Services on Kubernetes
Supported By
Slide 23
Slide 23 text
Monitoring Kubernetes using Prometheus
Supported By
Slide 24
Slide 24 text
Prometheus ❤ Kubernetes
- Kubernetes was inspired from Borg, Google’s container orchestration system
- Prometheus was inspired from Borgman, a system to monitor services on
Borg.
- Both are CNCF graduated projects.
-
Supported By
Slide 25
Slide 25 text
Monitoring Kubernetes workloads using Prometheus
Supported By
https://sysdig.com/blog/kubernetes-monitoring-prometheus/
Slide 26
Slide 26 text
Monitoring Kubernetes workloads using Prometheus
Supported By
OpenMetrics
Format
Text Exposition
Format
Pull based model
Instrumentation
Multidimensiona
l data model
Long Term
Storage
PromQL
Grafana
Built-in
visualization
Standard
Dashboards
AlertManager
Silencers
Inhibit Rules
Notification
Destinations
Service
Discovery
Standard K8S
Metrics
Prometheus
Exporters
Ingestion Storage Query Alerting
Slide 27
Slide 27 text
Challenges in monitoring Kubernetes Workloads
Supported By
?
Instrumentation
? ? ?
?
Ingestion Storage Query Alerting
Slide 28
Slide 28 text
Let’s roll this
Supported By
Slide 29
Slide 29 text
Setup & Instrumentation
Supported By
Slide 30
Slide 30 text
Setting up Kubernetes Monitoring using Prometheus
- Avoid Google Search
- Just use “kube-prometheus”
Supported By
Slide 31
Slide 31 text
Setting up Kubernetes Monitoring using Prometheus
Supported By
Slide 32
Slide 32 text
Setting up Kubernetes Monitoring using Prometheus
Supported By
https://training.promlabs.com/t
raining/prometheus-and-kuber
netes/prometheus-operator-int
roduction/operator-architectur
e
Slide 33
Slide 33 text
Challenges in monitoring Kubernetes Workloads
Supported By
?
Instrumentation
? ? ?
Operator ✅
Service
Discovery ✅
Service
Monitoring ✅
Ingestion Storage Query Alerting
Slide 34
Slide 34 text
Ingestion
Supported By
Slide 35
Slide 35 text
Ingestion
- Ingesting large volumes of data is rarely a problem.
- As long as reads are not significant.
- As long as system is not already primed.
- As long as storage is not for long durations.
Supported By
Slide 36
Slide 36 text
Ingestion
The High Cardinality Challenge
4 (Clusters) x
1000 (Service) x
3 (Environment) x
10,000 (Pods)
= 12,000,000 🙀
Supported By
https://www.foomo.org/blog/prometheus-
cardinality-issues
Slide 37
Slide 37 text
Solving High Cardinality
Downsampling
Supported By
Slide 38
Slide 38 text
Solving High Cardinality
Drop labels
Supported By
Slide 39
Slide 39 text
It is difficult to make predictions,
especially about the future
Supported By
Slide 40
Slide 40 text
Challenges in monitoring Kubernetes Workloads
Supported By
Drop ✅
Relabel ✅
Rename ✅
Cardinality 😭
Compute 😭
Cost 😭
Instrumentation
? ? ?
Operator ✅
Service
Discovery ✅
Service
Monitoring ✅
Ingestion Storage Query Alerting
Slide 41
Slide 41 text
Storage
Supported By
Slide 42
Slide 42 text
Storage
Supported By
- Millions of metrics but mostly unused.
- Data exists but not accessible when needed.
- Lack of workload isolation
- Alerting needs fresh data, SLOs need data on longer horizon.
- Can do remote write / Thanos / Levitate
Slide 43
Slide 43 text
Challenges in monitoring Kubernetes Workloads
Supported By
Drop ✅
Relabel ✅
Rename ✅
Cardinality 😭
Compute 😭
Cost 😭
Instrumentation
PV ✅
Storage Cost 😭
Remote Write ✅
? ?
Operator ✅
Service
Discovery ✅
Service
Monitoring ✅
Ingestion Storage Query Alerting
Slide 44
Slide 44 text
Query
Supported By
Slide 45
Slide 45 text
Query
Supported By
- You are fine as long as you are not reading.
- Storage - Single Point of Failure
Slide 46
Slide 46 text
Query
Supported By
- You are fine as long as you are not reading.
- Storage - Single Point of Failure
- Compute cost skyrockets.
- Solution: Limit instrumentation / Limit query range.
- Lack of traffic shaping.
Slide 47
Slide 47 text
Challenges in monitoring Kubernetes Workloads
Supported By
Drop ✅
Relabel ✅
Rename ✅
Cardinality 😭
Compute Cost
😭
Instrumentation
PV ✅
Storage Cost 😭
Remote Write ✅
PromQL ✅
Standard
Dashboards ✅
Concurrent
Access 😭
Compute Cost
😭
?
Operator ✅
Service
Discovery ✅
Service
Monitoring ✅
Ingestion Storage Query Alerting
Slide 48
Slide 48 text
Alerting
Supported By
Slide 49
Slide 49 text
Alerting
Supported By
- Fatigue
- Signal vs. Noise
- Historical Trends
Slide 50
Slide 50 text
Alerting
Supported By
- Thresholds
- Seasonality
Slide 51
Slide 51 text
Alerting
Supported By
Sane Defaults
https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes
Slide 52
Slide 52 text
Alerting
Supported By
Dark Side of Sane Defaults
- High Cardinality Labels in alerts and dashboards
- Latent Dashboards
- Invalid Alert Rules
Slide 53
Slide 53 text
Challenges in monitoring Kubernetes Workloads
Supported By
Drop ✅
Relabel ✅
Rename ✅
Cardinality 😭
Compute Cost
😭
Instrumentation
PV ✅
Storage Cost 😭
Remote Write ✅
PromQL ✅
Standard
Dashboards ✅
Concurrent
Access 😭
Compute Cost
😭
AlertManager ✅
Destinations ✅
Good Defaults
😐
No anomaly 😭
Basic Viz 😭
Operator ✅
Service
Discovery ✅
Service
Monitoring ✅
Ingestion Storage Query Alerting
Slide 54
Slide 54 text
Security
Supported By
Slide 55
Slide 55 text
Security
- Prom Operator runs on the cluster itself!
- Automatically gets right creds
- Can reach targets without extra networking permissions
Supported By
Slide 56
Slide 56 text
Cool Stuff
Supported By
Slide 57
Slide 57 text
Reduce Kubernetes Metrics
- Drop everything not used by Dashboards
and Alerts.
- Mimirtool
Supported By
Slide 58
Slide 58 text
Autoscale based on metrics
- Did you know, you can connect HPA to Metrics?
- Use prometheus-adapter
Supported By
Slide 59
Slide 59 text
Streaming Aggregations
- Run Aggregations before data is stored to reduce cardinality.
- Ignore pods, instances when not needed.
Supported By
Slide 60
Slide 60 text
Challenges in monitoring Kubernetes Workloads
Supported By
Drop ✅
Relabel ✅
Rename ✅
Cardinality 😭
Compute Cost
😭
Instrumentation
PV ✅
Storage Cost 😭
Remote Write ✅
PromQL ✅
Standard
Dashboards ✅
Concurrent
Access 😭
Compute Cost
😭
AlertManager ✅
Destinations ✅
Good Defaults
✅
No anomaly 😭
Basic Viz 😭
Operator ✅
Service
Discovery ✅
Service
Monitoring ✅
Ingestion Storage Query Alerting
Slide 61
Slide 61 text
Supported By
നന്ദി
llast9.io/levitate-tsdb
Prathamesh Sonpatki
Developer Evangelist, Last9