Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ask an OpenShift Expert - Ep 159 - Workload Ava...

Ask an OpenShift Expert - Ep 159 - Workload Availability

Avatar for Red Hat Livestreaming

Red Hat Livestreaming

October 22, 2025
Tweet

More Decks by Red Hat Livestreaming

Other Decks in Technology

Transcript

  1. 1 Workload Availability for Red Hat OpenShift Virtualization Andrew Sullivan

    Director, Technical Marketing OpenShift Virtualization
  2. 2 OpenShift Node OpenShift Node Stateful Set At-Most-One Semantics VM

    Pod RWX Persistent Volume Only one instance of a VM can be running at a time (“at-most-one”) to avoid data corruption
  3. 3 OpenShift Control Plane OpenShift Node OpenShift Node Missing 4

    heartbeats x 10 seconds + 10 seconds = 50 seconds Node must reach a safe state before workload recovery can begin status=Unknown Node failure detection is a prerequisite for remediation VM
  4. 4 OpenShift Control Plane OpenShift Node OpenShift Node Node has

    reached a safe state, workload recovery begins After fencing, workload recovery begins VM status=Down VM
  5. 5 Automatic Workload Recovery Conservative values are configured by default

    Kubernetes Control Plane Node conditions switches: Ready=Unknown Node Health Check If Ready=Unknown condition is present for a configured duration, then start remediation t1 SNR Remediate Host The remediator fences / isolates the node by rebooting it in order to reach a safe state - SNR or FAR t2 SNR Remediate API The remediator deletes resources to enable the rescheduling of affected workload t3 Workload restarted The workload is rescheduled and restarted t4 Note: This is a simplified example which does not cover all cases 50s 300s 180s 15s 545s
  6. 6 OpenShift fault detection and remediation • OpenShift can detect

    node failures and automatically reschedule most workloads, however virtual machines (VMs) in OpenShift Virtualization are not included in this mechanism as a result of their PVCs. ◦ VMs do not automatically fail over when their host node becomes unhealthy. • To enable failover for VMs, Node Health Check (NHC), Self Node Remediation (SNR), and Fence Agent Remediation (FAR) are used. Node HealthCheck Operator (NHC) Self Node Remediation Operator (SNR) Fence Agents Remediation Operator (FAR) Step 1. Fault detection with Node Health Check (NHC) Operator • Mark the node to "Unhealthy" when the set threshold is reached. • By creating a custom resource, the node is notified that it needs to be remediated. Step 2. Recover using Self Node Remediation Operator (SNR) / Fence Agents Remediation (FAR) Operator • When a custom resource is detected, the node is restarted. • SNR/FAR deletes VMI of virtual machines along their respective processes. Step 1. Fault detection Step 2. Remediation
  7. 7 Fence Agents Remediation (FAR) • Hardware/Power based fencing •

    Cluster decides to fence • Confirmed when the node has reached the safe state (down) • fence-agent is power-cycling via BMC API ◦ Intelligent Platform Management Interface (IPMI) ◦ RedFish Remediation Implementations FAR is faster, SNR is (potentially) safer for workloads Self Node Remediation (SNR) • Software based fencing • Node decides to fence • Unconfirmed when node has reached the safe state (down) • Runs as a Daemonset - one pod per node • Can utilize hardware watchdog
  8. 9 Node HealthCheck Operator (NHC) • Node Health Check (NHC)

    continuously monitors the health of each node. If a node stays in a problematic state for longer than the configured time, NHC marks it as Unhealthy. • When a node is marked Unhealthy, NHC creates a custom resource that tells a Remediation Operator (such as SNR or FAR) which node needs remediation. Node HealthCheck Operator (NHC) node-01 node-02 node-03 VMs status: conditions: - type: Ready status: "False" Nodes that have been "False" or "Unknown" for 30 seconds are considered Unhealthy status: conditions: - type: Ready status: "True" status: conditions: - type: Ready status: "True" Specifying a custom resource template to remediate using FAR apiVersion: remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: nhc-worker-vms namespace: openshift-operators spec: selector: matchLabels: node-role.kubernetes.io/worker: "" # Target Node Label unhealthyConditions: - type: Ready status: "Unknown" timeout: 30s - type: Ready status: "False" timeout: 30s minHealthy: 51% # At least this much of the cluster must # be healthy or do not continue maxUnhealthy: 1 # If more than this number of nodes is # unhealthy, do not continue remediationTemplate: apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 kind: FenceAgentsRemediationTemplate name: fence-ipmilan-template NodeHealthCheck custom resource example One or the other, not both
  9. 10 • Escalating remediations allow the administrator to define multiple

    potential remediation actions ◦ For example, attempt FAR first and, if no success, attempt SNR • Allows for fastest method first (FAR) followed by failsafe method (SNR) to force workload to be rescheduled Escalating Remediations escalatingRemediations: - remediationTemplate: apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 Kind: FenceAgentsRemediationTemplate namespace: openshift-workload-availability name: far-template-fence-ipmilan order: 1 timeout: 300s - remediationTemplate: apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate namespace: openshift-workload-availability name: snr-template-OutOfServiceTaint order: 2 timeout: 30m Escalating remediations example:
  10. 12 Self Node Remediation Operator (SNR) • Self Node Remediation

    (SNR) runs an agent on every node, which can perform a soft reboot if the node becomes unhealthy. • The SNR agent communicates with the API server and peer nodes to confirm whether its own node is truly unhealthy. If confirmed, the agent reboots the node. At the same time, the peer nodes delete the VMIs of workloads that were on the unhealthy node. • This approach prioritizes safety and accuracy, reducing the risk of accidental or unnecessary remediation. Self Node Remediation Operator (SNR) • When the Node Health Check (NHC) creates an SNR custom resource, the API server notifies peer nodes. ◦ Each node regularly checks its health with the API server; if that fails, it double-checks with a peer node. • If the node confirms it is unhealthy, it restarts its own operating system (using a software watchdog). • Peer nodes then notify the API server to delete the VMIs from the unhealthy node. ◦ Because the unhealthy node is marked unschedulable, the VMs are scheduled to another node. • If a node cannot confirm its health with either the API server or a peer node, it will also restart. • Even after recovery, the node does not automatically fail back; workloads stay on the new node until moved, manually or via descheduler. node-01 node-02 node-03 VMs agent agent agent API Server Node HealthCheck Operator (NHC) ① node-03 is Unhealthy (Create SNR Custom Resource) ② node-03 is Unhealthy ③ Am I healthy? ⑤ Soft reboot ④ No ⑥ VMI deletion
  11. 13 Self Node Remediation Flows • If the failed node

    can communicate with the API server • If the failed node cannot communicate with the API server • If the failed node cannot communicate with either the API server or peer nodes Medik8s : How Self Node Remediation Works? https://www.medik8s.io/remediation/self-node-remediation/how-it-works/
  12. 14 Example Self Node Remediation Template apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate

    metadata: name: self-node-remediation-outofservicetaint-strategy-template namespace: openshift-workload-availability spec: template: spec: remediationStrategy: OutOfServiceTaint
  13. 15 Example Self Node Remediation Config apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationConfig

    metadata: name: self-node-remediation-config namespace: openshift-workload-availability spec: safeTimeToAssumeNodeRebootedSeconds: 180 watchdogFilePath: /dev/watchdog isSoftwareRebootEnabled: true apiServerTimeout: 15s apiCheckInterval: 5s maxApiErrorThreshold: 3 peerApiServerTimeout: 5s peerDialTimeout: 5s peerRequestTimeout: 5s peerUpdateInterval: 15m hostPort: 30001
  14. 16 Example Node Health Check with Self Node Remediation apiVersion:

    remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: nhc-worker-self spec: minHealthy: 90% remediationTemplate: apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate name: self-node-remediation-outofservicetaint-strategy-template namespace: openshift-workload-availability selector: matchExpressions: - key: node-role.kubernetes.io/control-plane operator: DoesNotExist - key: node-role.kubernetes.io/master operator: DoesNotExist unhealthyConditions: - duration: 60s status: 'False' type: Ready - duration: 60s status: 'Unknown' type: Ready
  15. 18 Fence Agents Remediation Operator (FAR) • Fence Agent Remediation

    (FAR) uses a Fence Agent to isolate an unhealthy node and trigger a hard reboot. • In OpenShift Virtualization, this is typically done through a Fence Agent that communicates with the node’s BMC (Baseboard Management Controller), using standards like IPMI or Redfish to send the reboot command. Fence Agents Remediation Operator (FAR) • When the Node Health Check (NHC) creates a FAR custom resource, the FAR Controller immediately detects it and begins remediation. • The FAR Controller fences the affected node, forcing a hard reboot. ◦ At the same time, the API server deletes the Virtual Machine Instance (VMI) running on the unhealthy node. • Because the node is tainted, the new VMI is scheduled onto another node, and the virtual machine starts there. • If the original node later recovers, workloads do not automatically fail back; they remain running on the new node until moved, manually or via descheduler. node-01 node-02 node-03 VMs API Server *1 Power cycle restart Node HealthCheck Operator (NHC) ① node-03 is Unhealthy (Create a FAR custom resource) ④ Hard reboot ② FAR Custom Resource Discovery ③ fence ④ delete VMI
  16. 19 Fence Agents Remediation Flows Medik8s : How does Fence

    Agents Remediation work? https://www.medik8s.io/remediation/fence-agents-remediation/how-it-work s/#how-does-fence-agents-remediation-work Node A FAR Controller API server NHC reboot fence Node A is unhealthy Node A is unhealthy * Event driven Taint node Reschedule node A’s workloads Schedule node A’s workloads Delete node A’s workload Node A is restored • When a node is marked unhealthy by the Node Health Check (NHC), a Fence Agent Remediation (FAR) Custom Resource (CR) is created. This triggers the remediation process, regardless of the node’s current status. • The FAR Controller works in an event-driven way: as soon as a FAR CR appears, it immediately initiates the fencing action. It does not perform additional health checks on the node before acting. • At the same time, the FAR Controller applies an out of service taint to the unhealthy node. This forces workloads to be evicted and rescheduled onto healthy nodes, minimizing disruption. • Once the fencing and recovery steps are completed, the FAR CR is automatically deleted, signaling that remediation has finished. * Create FAR CR
  17. 20 Example Fence Agents Remediation Template apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 kind: FenceAgentsRemediationTemplate

    metadata: name: far-template-fence-ipmilan namespace: openshift-workload-availability spec: template: spec: agent: fence_ipmilan nodeparameters: '--ip': worker1: 192.168.123.1 worker2: 192.168.123.2 worker3: 192.168.123.3 sharedparameters: '--action': reboot '--password': password '--username': admin retryCount: '5' retryInterval: '5s' timeout: '60s'
  18. 21 Example Node Health Check with Fence Agents Remediation apiVersion:

    remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: nhc-worker-self spec: minHealthy: 90% remediationTemplate: apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 kind: FenceAgentsRemediationTemplate name: far-template-fence-ipmilan namespace: openshift-workload-availability selector: matchExpressions: - key: node-role.kubernetes.io/worker operator: Exists unhealthyConditions: - duration: 60s status: 'False' type: Ready - duration: 60s status: 'Unknown' type: Ready
  19. Engineering benchmarks with pre-release code 23 VM recovery times are

    shorter with FAR seconds - OpenShift 4.14 on 6 node bare metal average of best cases - The times depend will vary depending on the hardware - OpenShift Virtualization 4.16 is still under development. seconds
  20. 24 VM recovery time Custom Aggressive (FAR) Configuration Example NHC

    Defaults Aggressive FAR t4: Workload restarted KubeVirt 15s 15s t3: Remediate Host Medik8s 180s 16s t2: Medik8s HC Medik8s 300s 1s t1: Health Check Kubernetes 50s 50s VM recovery time 545s 82s
  21. 25 Fence Agents Remediation (FAR) + FAR has direct feedback

    from the API + Robustness + Speed: Reboot is acked by the API - Access to the BMC via network is required - Increases configuration complexity - The BMC being unresponsive can delay recovery - The FAR agent pod being affected by node failure can delay recovery Self Node Remediation vs Fence Agents Remediation Self Node Remediation (SNR) + does not require any management interface, fallback if FAR is not remediating - No quick feedback about the reboot request - No Ack, assume that the node is rebooting with buffer time