Secrets to Monitor Kubernetes Workloads - Speaker Deck

Tweet

Tweet

Slide 1

Slide 1 text

Supported By Secrets to Monitor Kubernetes Workloads Prathamesh Sonpatki Developer Evangelist, Last9

Slide 2

Slide 2 text

whoami Supported By ഹേലാ

Slide 3

Slide 3 text

whoami - Prathamesh Sonpatki - Developer turned Developer Evangelist at Last9 - SRE Stories - O11y.wiki - Twitter, Linkedin Supported By

Slide 4

Slide 4 text

Supported By last9.io/friends

Slide 5

Slide 5 text

Agenda - Why should you care? - Why “Monitoring”? - Prometheus and Kubernetes - Challenges of Monitoring Kubernetes Workloads - Some Cool Stuff Supported By

Slide 6

Slide 6 text

Why should you care? Kubernetes is de-facto standard managing and deploying modern microservices based apps. Supported By

Slide 7

Slide 7 text

Why should you care? A big box of interconnected components. Supported By

Slide 8

Slide 8 text

Why should you care? - Fun to learn new things! Supported By

Slide 9

Slide 9 text

What is Monitoring? Supported By https://last9.io/blog/radar-and-black-boxes-for-software-observability/ vs.

Slide 10

Slide 10 text

What is Monitoring? Supported By

Slide 11

Slide 11 text

Monitoring Kubernetes Workloads Who is monitoring? - DevOps / SRE / Operators - App Developers - Product - Business Stakeholders Supported By

Slide 12

Slide 12 text

Monitoring Kubernetes Workloads When they are monitoring? - War Time - Peace Time Supported By

Slide 13

Slide 13 text

Ideal Monitoring Wishlist* Supported By Pre-Processing Pipelines Instrumentation Long Term Storage Faster Access Standard Query Interface Standard Dashboards Workload Isolation Relevant Context Automation Coverage Signal vs. Noise Ingestion Storage Query Alerting

Slide 14

Slide 14 text

Monitoring Kubernetes Workloads (at bare minimum…) - Monitoring Kubernetes itself Supported By - Monitoring Apps, Services, Jobs running in Kubernetes

Slide 15

Slide 15 text

Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By

Slide 16

Slide 16 text

Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By

Slide 17

Slide 17 text

Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By

Slide 18

Slide 18 text

Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By

Slide 19

Slide 19 text

Monitoring Kubernetes itself - Kubernetes exposes Metrics in OpenMetrics format for different components https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-usage-monitoring/ Supported By

Slide 20

Slide 20 text

Monitoring Apps, Services on Kubernetes Supported By

Slide 21

Slide 21 text

Monitoring Apps, Services on Kubernetes Supported By - The challenge of Service Discovery and Ephemeral resources

Slide 22

Slide 22 text

Monitoring Apps, Services on Kubernetes Supported By

Slide 23

Slide 23 text

Monitoring Kubernetes using Prometheus Supported By

Slide 24

Slide 24 text

Prometheus ❤ Kubernetes - Kubernetes was inspired from Borg, Google’s container orchestration system - Prometheus was inspired from Borgman, a system to monitor services on Borg. - Both are CNCF graduated projects. - Supported By

Slide 25

Slide 25 text

Monitoring Kubernetes workloads using Prometheus Supported By https://sysdig.com/blog/kubernetes-monitoring-prometheus/

Slide 26

Slide 26 text

Monitoring Kubernetes workloads using Prometheus Supported By OpenMetrics Format Text Exposition Format Pull based model Instrumentation Multidimensiona l data model Long Term Storage PromQL Grafana Built-in visualization Standard Dashboards AlertManager Silencers Inhibit Rules Notification Destinations Service Discovery Standard K8S Metrics Prometheus Exporters Ingestion Storage Query Alerting

Slide 27

Slide 27 text

Challenges in monitoring Kubernetes Workloads Supported By ? Instrumentation ? ? ? ? Ingestion Storage Query Alerting

Slide 28

Slide 28 text

Let’s roll this Supported By

Slide 29

Slide 29 text

Setup & Instrumentation Supported By

Slide 30

Slide 30 text

Setting up Kubernetes Monitoring using Prometheus - Avoid Google Search - Just use “kube-prometheus” Supported By

Slide 31

Slide 31 text

Setting up Kubernetes Monitoring using Prometheus Supported By

Slide 32

Slide 32 text

Setting up Kubernetes Monitoring using Prometheus Supported By https://training.promlabs.com/t raining/prometheus-and-kuber netes/prometheus-operator-int roduction/operator-architectur e

Slide 33

Slide 33 text

Challenges in monitoring Kubernetes Workloads Supported By ? Instrumentation ? ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Slide 34

Slide 34 text

Ingestion Supported By

Slide 35

Slide 35 text

Ingestion - Ingesting large volumes of data is rarely a problem. - As long as reads are not significant. - As long as system is not already primed. - As long as storage is not for long durations. Supported By

Slide 36

Slide 36 text

Ingestion The High Cardinality Challenge 4 (Clusters) x 1000 (Service) x 3 (Environment) x 10,000 (Pods) = 12,000,000 🙀 Supported By https://www.foomo.org/blog/prometheus- cardinality-issues

Slide 37

Slide 37 text

Solving High Cardinality Downsampling Supported By

Slide 38

Slide 38 text

Solving High Cardinality Drop labels Supported By

Slide 39

Slide 39 text

It is difficult to make predictions, especially about the future Supported By

Slide 40

Slide 40 text

Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel ✅ Rename ✅ Cardinality 😭 Compute 😭 Cost 😭 Instrumentation ? ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Slide 41

Slide 41 text

Storage Supported By

Slide 42

Slide 42 text

Storage Supported By - Millions of metrics but mostly unused. - Data exists but not accessible when needed. - Lack of workload isolation - Alerting needs fresh data, SLOs need data on longer horizon. - Can do remote write / Thanos / Levitate

Slide 43

Slide 43 text

Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel ✅ Rename ✅ Cardinality 😭 Compute 😭 Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ ? ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Slide 44

Slide 44 text

Query Supported By

Slide 45

Slide 45 text

Query Supported By - You are fine as long as you are not reading. - Storage - Single Point of Failure

Slide 46

Slide 46 text

Query Supported By - You are fine as long as you are not reading. - Storage - Single Point of Failure - Compute cost skyrockets. - Solution: Limit instrumentation / Limit query range. - Lack of traffic shaping.

Slide 47

Slide 47 text

Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel ✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 ? Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Slide 48

Slide 48 text

Alerting Supported By

Slide 49

Slide 49 text

Alerting Supported By - Fatigue - Signal vs. Noise - Historical Trends

Slide 50

Slide 50 text

Alerting Supported By - Thresholds - Seasonality

Slide 51

Slide 51 text

Alerting Supported By Sane Defaults https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes

Slide 52

Slide 52 text

Alerting Supported By Dark Side of Sane Defaults - High Cardinality Labels in alerts and dashboards - Latent Dashboards - Invalid Alert Rules

Slide 53

Slide 53 text

Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel ✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 AlertManager ✅ Destinations ✅ Good Defaults 😐 No anomaly 😭 Basic Viz 😭 Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Slide 54

Slide 54 text

Security Supported By

Slide 55

Slide 55 text

Security - Prom Operator runs on the cluster itself! - Automatically gets right creds - Can reach targets without extra networking permissions Supported By

Slide 56

Slide 56 text

Cool Stuff Supported By

Slide 57

Slide 57 text

Reduce Kubernetes Metrics - Drop everything not used by Dashboards and Alerts. - Mimirtool Supported By

Slide 58

Slide 58 text

Autoscale based on metrics - Did you know, you can connect HPA to Metrics? - Use prometheus-adapter Supported By

Slide 59

Slide 59 text

Streaming Aggregations - Run Aggregations before data is stored to reduce cardinality. - Ignore pods, instances when not needed. Supported By

Slide 60

Slide 60 text

Challenges in monitoring Kubernetes Workloads Supported By Drop ✅ Relabel ✅ Rename ✅ Cardinality 😭 Compute Cost 😭 Instrumentation PV ✅ Storage Cost 😭 Remote Write ✅ PromQL ✅ Standard Dashboards ✅ Concurrent Access 😭 Compute Cost 😭 AlertManager ✅ Destinations ✅ Good Defaults ✅ No anomaly 😭 Basic Viz 😭 Operator ✅ Service Discovery ✅ Service Monitoring ✅ Ingestion Storage Query Alerting

Slide 61

Slide 61 text

Supported By നന്ദി llast9.io/levitate-tsdb Prathamesh Sonpatki Developer Evangelist, Last9