Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes self-healing of your workload

Kubernetes self-healing of your workload

The presentation slides by HungWei Chiu provide an in-depth operational discussion of Kubernetes self-healing and failover design, particularly focusing on the differences between stateless and stateful applications. The source explains that stateless pods will eventually be redeployed following a node failure, with a total recovery time of approximately 350 seconds by default, determined by node health detection (50 seconds) and pod eviction timeouts (300 seconds). Conversely, stateful pods are not automatically redeployed upon node failure, requiring manual intervention due to challenges like persistent volume unmounting. Key factors discussed for failover include the time to declare a node unhealthy using Kubelet lease objects and the time until pods are evicted using taints and tolerations. The source also introduces the OutOfServiceTaint as a manual solution to accelerate the recovery of StatefulSets by reducing the default six-minute volume unmount timeout during node downtime.

Avatar for Hung-Wei Chiu

Hung-Wei Chiu

October 22, 2025
Tweet

More Decks by Hung-Wei Chiu

Other Decks in Technology

Transcript

  1. Preface • I recently shared a talk on operating large-scale

    Kubernetes clusters at the TSMC IT event. • https://www.youtube.com/playlist? list=PLT3USJy3vydAu1XUGO5dY30gd2RBw1QgT • That session covered several challenges encountered when clusters and workloads scale up to a massive level. • Today, we’ll focus on Kubernetes failover design — understanding its implementation can greatly simplify our daily operations and make Kubernetes more predictable in production.
  2. Case (1) • Node down (Stateless Pod) • docker stop

    kind-worker • How log does it take for the pod to be redeployed ? • 1/6/12/18/35 mins ? Ubuntu Container Container Container Kubernetes Pod KIND 1.34 Deployment Pod
  3. Case (2) • Node down (Stateful Pod without PVC) •

    docker stop kind-worker • How log does it take for the pod to be redeployed ? • 1/6/12/18/35/60 mins or others? Ubuntu Container Container Container Kubernetes Pod KIND 1.34 Deployment Pod
  4. Kubernetes • Stateless application • Pod will eventually be redeployed

    • ~6-mins redeploy time acceptable ? • Stateful application • Pods aren’t redeployed automatically • Requires intervention -> unacceptable for cluster operations
  5. Kubernetes - Stateful • Stateful application • Each pod has

    a unique, persistent name • web-0 1/1 Terminating 0 56m • Pod isn’t redeployed until the existing one is deleted • During a node outage, the existing pods’s status may be inaccessible •
  6. Kubernetes Kubernetes Node Stateful Pod Kubelet Controller Kubernetes Node Kubelet

    Con f irm the exiting pod is terminated Deploy the new pod (same name)
  7. Kubernetes Self-Healing • As a Kubernetes administrator • Understand Kubernetes

    internals to adjust and implement solutions for your environment • Set clear expectations of Kubernetes behavior
  8. K8s Failover - Self-Healing • Kubernetes self-healing(failover) behavior • Does

    failover meet your expectations? • 5 mins for stateless workloads • Never for stateful workloads • Options • Stateless apps -> handle with replication • Stateful apps -> SRE must monitor and f ix ASAP • Is this realistic?
  9. Failover Key Factor • Critical Factor: Time to Restore Service

    • Determined by two main factors • When to declare a node unhealthy • When pods are evicted to healthy nodes
  10. When To Declare a Node Unhealthy • Detecting Node Outages

    in Distributed System • Node may experiences issues such as • Agent failure (kubelet, cube-proxy) • Network issues • System/OS issues • Hardware issues • Some node issues impact containers directly, while other do not
  11. Node Health Detection • Node Health Monitoring in Kubernetes •

    It’s impossible to monitor and alert for all possible issues • Kubernetes relies on kubelet reporting node health to detect problems API Server Kubelet Kubelet Kubelet
  12. Kubelet Reporting & Node Health • Kubelet Node reporting •

    Kubelet reports to API server periodically • Node Health status -> every 10 seconds • System/Workload information -> every 5 mins with conditions • CPU, Disk, Memory presses, etc. • Controller check health periodically • Node is marked unhealthy if last heartbeat exceeds the de f ined period
  13. Node Heartbeat - Lease Object • Check heartbeat in Kubernetes

    via lease object in kube-node-lease namespace • One lease per node • Updated every 10 seconds by default
  14. Heartbeat Update Interval • Update interval = 0.25 * leaseDuration

    (from Kubelet con f ig) • Default leaseDuration = 40 seconds • Update interval = 40 * 0.25 = 10 seconds https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L862-L873?WT.mc_id=AZ-MVP-5003331
  15. Adjusting Lease Duration • Lease duration can be adjusted via

    • /var/lib/kubelet/con f ig.yaml • nodeLeaseDurationSeconds
  16. Controller Node Timeout • Controller de f ines node timeout

    period (default: 50 seconds) • Health check interval = 5 seconds • Cluster-wide setting • Cannot be changed in managed Kubernetes services
  17. Controller Con f iguration Options • Support two options to

    con f igure https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
  18. Kubelet Lease Update - HTTP/2 • Kubelet updates lease objects

    via HTTP/2 and total timeout is 45 seconds • HTTP2_PING_TIMEOUT_SECONDS (30 secs) + • HTTP2_READ_IDLE_TIMEOUT_SECONDS (15 secs) • Kubernetes 1.31+ increased timeout from 40 to 50 seconds https://github.com/kubernetes/kubernetes/issues/121793
  19. Adjusting Timeout & Heartsease Loss • Node Health Timeout &

    Considerations • Timeout can be adjusted for faster detection • Be careful of false alarms • Heartbeat can be lost due to • Network issues between Node, API Server, ETCD and Controller • Kubelet issues • Node issues
  20. Node Health Check Limitations • Health check = message to

    API server (“I’m alive”) • Can it cover all failure cases ? • System issues • Hardware issues • Network issues • OS issues
  21. Challenges Hardware Ubuntu Kubelet CPU Memory Disk NIC Kernel API

    Server API Server API Server API Server API Server LoadBalancer Local Area Network
  22. Challenges Hardware Ubuntu Kubelet CPU Memory Disk NIC Kernel API

    Server API Server API Server API Server API Server LoadBalancer Local Area Network
  23. Pod Eviction - Stateless Workloads • When will a pod

    be evicted after its node is marked unhealthy? • Kubernetes handles this via two mechanisms • Controller-based con f iguration • Taint-based con f iguration
  24. Adjusting Eviction Time - Controller • Controller supports pod-eviction-timeout to

    adjust default timeout • Default = 300 seconds • Deprecated since v1.27 • https://github.com/kubernetes/website/commit/3a81c94ba8b6ada277bc5e5e44a4e7ce62c2cfa9
  25. Taint & Tolerations - Pod Eviction • Taints and tolerations

    adjust scheduling strategies • Important but often ignored attribute: tolerationSeconds • Only valid with NoExecute effect • Affects pods already running on the node • Pod that tolerate the taint remain bound for the speci f ied seconds • After tolerationSeconds ,controller evicts the pod from the node
  26. Default Pod Tolerations • Every pod is added with the

    following two tolerations by default
  27. Default Pod Eviction Timing • Pod Eviction Timing for Unhealthy

    Nodes • When node has NotReady or Unreachable taint for 300 seconds • Timing breakdown • 50 seconds -> unhealthy node detection • 300 seconds -> node controller eviction • Total failure time ~= 350 seconds for stateless pod
  28. Possible Solution • Detect node issues and trigger pod r

    deployment proactively • Drain node to force pods to redeploy, making them available early • Stateless: 350 seconds • Stateful: never • Stateful pods: never automatically redeployed • Administrator must intervene; otherwise pod remains Terminating until node recovers
  29. Kubernetes Failover - Caveats • Kubernetes supports pod failover during

    node outages • May not meet expectations • Works for stateless, not fully for stasteful • NodeNotReady relies on HTTP/2-based API calling • Cannot cover all node-level issues • Some issue affect container performance even if node is marked healthy • Taint-based con f iguration affect the default fail-over time for stateless application.
  30. Stateful Applications - Failover Challenges • However, most statefulset application

    will mount persistent storage, • Will it change our expectations?
  31. Case (3) • Node down (Stateful Pod with PVC) •

    docker stop kind-worker • kubectl drain kind-worker --force =1 • How log does it take for the pod to be redeployed ? • 1/6/12/18/35/60 mins or others? Ubuntu Container Container Container Kubernetes Pod KIND 1.34 Deployment Pod
  32. Kubernetes Failover - STS + PVC • We have wait

    at least 6 mins after we drain node with force option • Does it matter ? • Lost 1 stateful pod for 6+ mins • 50 seconds for node healthy detection • Xxx seconds for draining nodes • 6 mins for waiting
  33. Kubernetes Failover - STS + PVC • StatefulSet Volume Unmount

    Issue • Kubernetes cannot determine mount status if kubelet is down • By default, controller waits 6 mins (timeout) • After timeout • Assumes volume is safely unmounted • Delete volumeAttachment object • Delete statefulset pod
  34. OutOfServiceTaint • OutOfService Taint for statefulset • Feature to enhance

    availability of Statefulset during node downtime. • Stable in K8s 1.28 • 1.28 stable feature: 2268 - non-graceful-shutdown https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown
  35. OutOfServiceTaint • Improving StatefulSet Recovery • Reduce unmount timeout in

    the attachment controller • Statefulset recovers faster • Taint the node with a speci f ic key/value to change its behavior https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown
  36. OutOfService Taint • OutOfService Taint • Designed for manual adjustment:

    admin adds taints to unhealthy nodes • Work f low leveraging previous solution • Detect node health • Taint node • Drain node
  37. Kubernetes Failover - Caveats • Kubernetes’s self-healing limitations • During

    node down, self-healing only covers stateless pods ~= 350 seconds • For critical timing, administrators must implement additional tools to restore service faster • Behavior may differ if using a managed K8s service • Check carefully to ensure you understand how your cluster operates