Drift Happens! 3 Kubernetes Drift Scenarios & How to Overcome Them

Slide 1

Slide 1 text

Drift Happens! Kubernetes Drift Scenarios & How to Overcome Them Tuesday April 22nd, 2025

Slide 2

Slide 2 text

Housekeeping & Introductions Why Drift Happens The Impact of Drift on Actual Environments Best Practices and Strategies Q and A Agenda

Slide 3

Slide 3 text

Housekeeping ● Yes this webinar is recorded ● Use the Q+A section to ask questions ● ~45 minutes

Slide 4

Slide 4 text

Meet our speakers 👋 Ilan Adler Komodor PMM Chen Kubani Product Manager

Slide 5

Slide 5 text

A Quick Poll! ● How does your team primarily detect potential configuration drift in Kubernetes today?

Slide 6

Slide 6 text

A Quick Poll! A) Manual Checks – comparing manifests, kubectl diff, regular reviews. B) Reactively – usually only discovered when investigating an incident or failure. C) Using built-in features of GitOps tools (like Argo CD, Flux). D) We don't have a specific or consistent process for detecting drift.

Slide 7

Slide 7 text

“We can’t track who changed what across our clusters” “Configuration drift between clusters is a constant problem” “Our GitOps workflow breaks down when changes that meant for DEV, ended up in PROD” Common Drift Concerns

Slide 8

Slide 8 text

K8s Estate Increases ● More clusters, more services - more issues and headaches Manual Changes & Control ● Break glass mechanisms are important but can be debilitating Deployment Issues ● Large scale and complex Kubernetes environments can suffer from inconsistent deployments “drifting” from baseline configurations Why Does Drift Happen???

Slide 9

Slide 9 text

Tales of the Drift

Slide 10

Slide 10 text

01 Configuration Drift Across Environments Inconsistent Behavior in a Service A service deployed across two regions: Prod EU and Prod US, runs smoothly in EU. The Culprit - Inconsistent Memory Limits Due to a misconfiguration during deployment The Cost - 1 Hour of Troubleshooting Took the team an hour to identify the issue at hand.

Slide 11

Slide 11 text

02 Managing a Large K8s Fleet Degraded Cluster Performance Managing hundreds of services across multiple clusters. The Culprit - Outdated Container Image An incomplete deployment process left the cluster with an outdated image. The Cost - 4 Hours of Analysis Multiple team members spent hours trying to detect the root cause of performance issues.

Slide 12

Slide 12 text

03 GitOps Workflow Service Reliability Issues Pod Crashes for a Critical Service Started with a new feature rollout The Culprit - Liveness Probes Incorrectly Configured The Cost - 1 Full Day to Recover A container image with non-prod configurations was deployed due to GitOps workflows Took the developer and escalated SRE engineer to identify and remediate

Slide 13

Slide 13 text

Understanding the Full Impact of Drift Performance and Stability Issues ● Degraded service performance ● Increased failure rates and downtime ● Longer troubleshooting time due to hard-to-detect configuration discrepancies  Security Issues ● Vulnerabilities from outdated or misconfigured services  Cost and Inefficiency Issues ● Services running misaligned configurations can impact cloud costs

Slide 14

Slide 14 text

Recommendations and Techniques Use policies and automation to limit risky manual changes and enforce best practices. Set Guardrails where Possible Use Git as the single source of truth for configurations. GitOps ensures visibility, consistency, and accountability across environments.  Move towards GitOps Proactively catch misconfigurations with automated alerts and self-healing mechanisms to reduce MTTR.  Automate Everything Drift happens — your ability to detect and react defines your resilience. Here are key strategies to proactively manage and reduce the risk of drift: Treat drift checks as a default part of incident response — it can dramatically speed up root cause identification. Integrate Drift into Troubleshooting

Slide 15

Slide 15 text

Immediately identify root cause, and quickly resolve it. Intuitive and user friendly view with detailed insights. Compare versions and resources on your Helm charts. Winning the Battle Against K8s Drift! Easy to Use Visual Experience Easily edit the desired state, and enforce best practices with all resources types. Diff only mode for changes in multiple services. Accelerate Troubleshooting & Recovery Detect Discrepancies Keep service configurations uniform across complex K8s environments. Flag deviations as reliability risks and standardize configs across the fleet. Automate Drift Detection Automatically detect and remediate. Connect to GitOps tooling to maintain a consistent source of truth.