Kubernetes self-healing of your workload

Kubernetes Self-Healing 背後的維運辛苦談 HungWei Chiu 10/23 KubeSummit 2025

Preface • I recently shared a talk on operating large-scale
Kubernetes clusters at the TSMC IT event. • https://www.youtube.com/playlist? list=PLT3USJy3vydAu1XUGO5dY30gd2RBw1QgT • That session covered several challenges encountered when clusters and workloads scale up to a massive level. • Today, we’ll focus on Kubernetes failover design — understanding its implementation can greatly simplify our daily operations and make Kubernetes more predictable in production.

A Kubernetes Example Ubuntu Container Container Container Kubernetes Pod KIND
1.34 Deployment

Case (1) • Node down (Stateless Pod) • docker stop
kind-worker • How long does it take for the pod to be redeployed ? • 1/6/12/18/35 mins ? Ubuntu Container Container Container Kubernetes Pod KIND 1.34 Deployment Pod

Case (2) • Node down (Stateful Pod without PVC) •
docker stop kind-worker • How long does it take for the pod to be redeployed ? • 1/6/12/18/35/60 mins or others? Ubuntu Container Container Container Kubernetes Pod KIND 1.34 Deployment Pod

Ans • Stateless • Around 6 mins • Stateful •
Never

Kubernetes • Stateless application • Pod will eventually be redeployed
• ~6-mins redeploy time acceptable ? • Stateful application • Pods aren’t redeployed automatically • Requires intervention -> unacceptable for cluster operations

Kubernetes - Stateful • Stateful application • Each pod has
a unique, persistent name • web-0 1/1 Terminating 0 56m • Pod isn’t redeployed until the existing one is deleted • During a node outage, the existing pods’s status may be inaccessible

Kubernetes Kubernetes Node Stateful Pod Kubelet Controller Kubernetes Node Kubelet
Con f irm the exiting pod is terminated Deploy the new pod (same name)

Kubernetes Self-Healing • As a Kubernetes administrator • Understand Kubernetes
internals to adjust and implement solutions for your environment • Set clear expectations of Kubernetes behavior

K8s Failover - Self-Healing • Kubernetes self-healing(failover) behavior • Does
failover meet your expectations? • 5 mins for stateless workloads • Never for stateful workloads • Options • Stateless apps -> handle with replication • Stateful apps -> SRE must monitor and f ix ASAP • Is this realistic?

Failover Key Factor • Critical Factor: Time to Restore Service
• Determined by two main factors • When to declare a node unhealthy • When pods are evicted to healthy nodes

When To Declare a Node Unhealthy • Detecting Node Outages
in Distributed System • Node may experiences issues such as • Agent failure (kubelet, cube-proxy) • Network issues • System/OS issues • Hardware issues • Some node issues impact containers directly, while other do not

Node Health Detection • Node Health Monitoring in Kubernetes •
It’s impossible to monitor and alert for all possible issues • Kubernetes relies on kubelet reporting node health to detect problems API Server Kubelet Kubelet Kubelet

Kubelet Reporting & Node Health • Kubelet Node reporting •
Kubelet reports to API server periodically • Node Health status -> every 10 seconds • System/Workload information -> every 5 mins with conditions • CPU, Disk, Memory presses, etc. • Controller check health periodically • Node is marked unhealthy if last heartbeat exceeds the de f ined period

Node Heartbeat - Lease Object • Check heartbeat in Kubernetes
via lease object in kube-node-lease namespace • One lease per node • Updated every 10 seconds by default

Heartbeat Update Interval • Update interval = 0.25 * leaseDuration
(from Kubelet con f ig) • Default leaseDuration = 40 seconds • Update interval = 40 * 0.25 = 10 seconds https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L862-L873?WT.mc_id=AZ-MVP-5003331

Adjusting Lease Duration • Lease duration can be adjusted via
• /var/lib/kubelet/con f ig.yaml • nodeLeaseDurationSeconds

Controller Node Timeout • Controller de f ines node timeout
period (default: 50 seconds) • Health check interval = 5 seconds • Cluster-wide setting • Cannot be changed in managed Kubernetes services

Kubernetes - Controller API Server Kubelet Kubelet Kubelet Controller ETCD
Lease Object Check Lease object

Controller Con f iguration Options • Support two options to
con f igure https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/

Kubelet Lease Update - HTTP/2 • Kubelet updates lease objects
via HTTP/2 and total timeout is 45 seconds • HTTP2_PING_TIMEOUT_SECONDS (30 secs) + • HTTP2_READ_IDLE_TIMEOUT_SECONDS (15 secs) • Kubernetes 1.31+ increased timeout from 40 to 50 seconds https://github.com/kubernetes/kubernetes/issues/121793

Kubernetes - Controller https://github.com/kubernetes/kubernetes/blob/v1.34.1/pkg/controller/nodelifecycle/ node_lifecycle_controller.go#L657

https://github.com/kubernetes/kubernetes/blob/v1.34.1/pkg/controller/nodelifecycle/ node_lifecycle_controller.go

Adjusting Timeout & Heartsease Loss • Node Health Timeout &
Considerations • Timeout can be adjusted for faster detection • Be careful of false alarms • Heartbeat can be lost due to • Network issues between Node, API Server, ETCD and Controller • Kubelet issues • Node issues

Node Health Check Limitations • Health check = message to
API server (“I’m alive”) • Can it cover all failure cases ? • System issues • Hardware issues • Network issues • OS issues

Challenges Hardware Ubuntu Kubelet CPU Memory Disk NIC Kernel API
Server API Server API Server API Server API Server LoadBalancer Local Area Network

Pod Eviction - Stateless Workloads • When will a pod
be evicted after its node is marked unhealthy? • Kubernetes handles this via two mechanisms • Controller-based con f iguration • Taint-based con f iguration

Adjusting Eviction Time - Controller • Controller supports pod-eviction-timeout to
adjust default timeout • Default = 300 seconds • Deprecated since v1.27 • https://github.com/kubernetes/website/commit/3a81c94ba8b6ada277bc5e5e44a4e7ce62c2cfa9

Taint & Tolerations - Pod Eviction • Taints and tolerations
adjust scheduling strategies • Important but often ignored attribute: tolerationSeconds • Only valid with NoExecute effect • Affects pods already running on the node • Pod that tolerate the taint remain bound for the speci f ied seconds • After tolerationSeconds ,controller evicts the pod from the node

Default Pod Tolerations • Every pod is added with the
following two tolerations by default

Default Pod Eviction Timing • Pod Eviction Timing for Unhealthy
Nodes • When node has NotReady or Unreachable taint for 300 seconds • Timing breakdown • 50 seconds -> unhealthy node detection • 300 seconds -> node controller eviction • Total failure time ~= 350 seconds for stateless pod

Possible Solution • Detect node issues and trigger pod r
deployment proactively • Drain node to force pods to redeploy, making them available early • Stateless: 350 seconds • Stateful: never • Stateful pods: never automatically redeployed • Administrator must intervene; otherwise pod remains Terminating until node recovers

Kubernetes Failover - Caveats • Kubernetes supports pod failover during
node outages • May not meet expectations • Works for stateless, not fully for stasteful • NodeNotReady relies on HTTP/2-based API calling • Cannot cover all node-level issues • Some issue affect container performance even if node is marked healthy • Taint-based con f iguration affect the default fail-over time for stateless application.

Stateful Applications - Failover Challenges • However, most statefulset application
will mount persistent storage, • Will it change our expectations?

Case (3) • Node down (Stateful Pod with PVC) •
docker stop kind-worker • kubectl drain kind-worker --force =1 • How log does it take for the pod to be redeployed ? • 1/6/12/18/35/60 mins or others? Ubuntu Container Container Container Kubernetes Pod KIND 1.34 Deployment Pod

Kubernetes Failover - STS + PVC • We have wait
at least 6 mins after we drain node with force option • Does it matter ? • Lost 1 stateful pod for 6+ mins • 50 seconds for node healthy detection • Xxx seconds for draining nodes • 6 mins for waiting

Kubernetes Failover - STS + PVC • StatefulSet Volume Unmount
Issue • Kubernetes cannot determine mount status if kubelet is down • By default, controller waits 6 mins (timeout) • After timeout • Assumes volume is safely unmounted • Delete volumeAttachment object • Delete statefulset pod

Kubernetes Failover - STS + PVC https://github.com/kubernetes/kubernetes/ blob/release-1.33/pkg/controller/volume/ attachdetach/attach_detach_controller.go

Kubernetes Failover - STS + PVC https://github.com/kubernetes/kubernetes/ blob/master/pkg/controller/volume/ attachdetach/reconciler/reconciler.go#L67

OutOfServiceTaint • OutOfService Taint for statefulset • Feature to enhance
availability of Statefulset during node downtime. • Stable in K8s 1.28 • 1.28 stable feature: 2268 - non-graceful-shutdown https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown

OutOfServiceTaint • Improving StatefulSet Recovery • Reduce unmount timeout in
the attachment controller • Statefulset recovers faster • Taint the node with a speci f ic key/value to change its behavior https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown

Related PRs

OutOfService Taint • OutOfService Taint • Designed for manual adjustment:
admin adds taints to unhealthy nodes • Work f low leveraging previous solution • Detect node health • Taint node • Drain node

Kubernetes Failover - Caveats • Kubernetes’s self-healing limitations • During
node down, self-healing only covers stateless pods ~= 350 seconds • For critical timing, administrators must implement additional tools to restore service faster • Behavior may differ if using a managed K8s service • Check carefully to ensure you understand how your cluster operates

Kubernetes self-healing of your workload

Kubernetes self-healing of your workload

More Decks by Hung-Wei Chiu

Other Decks in Technology

Featured

Transcript