Ask an OpenShift Expert - Ep 159 - Workload Availability

1 Workload Availability for Red Hat OpenShift Virtualization Andrew Sullivan
Director, Technical Marketing OpenShift Virtualization

2 OpenShift Node OpenShift Node Stateful Set At-Most-One Semantics VM
Pod RWX Persistent Volume Only one instance of a VM can be running at a time (“at-most-one”) to avoid data corruption

3 OpenShift Control Plane OpenShift Node OpenShift Node Missing 4
heartbeats x 10 seconds + 10 seconds = 50 seconds Node must reach a safe state before workload recovery can begin status=Unknown Node failure detection is a prerequisite for remediation VM

4 OpenShift Control Plane OpenShift Node OpenShift Node Node has
reached a safe state, workload recovery begins After fencing, workload recovery begins VM status=Down VM

5 Automatic Workload Recovery Conservative values are configured by default
Kubernetes Control Plane Node conditions switches: Ready=Unknown Node Health Check If Ready=Unknown condition is present for a configured duration, then start remediation t1 SNR Remediate Host The remediator fences / isolates the node by rebooting it in order to reach a safe state - SNR or FAR t2 SNR Remediate API The remediator deletes resources to enable the rescheduling of affected workload t3 Workload restarted The workload is rescheduled and restarted t4 Note: This is a simplified example which does not cover all cases 50s 300s 180s 15s 545s

6 OpenShift fault detection and remediation • OpenShift can detect
node failures and automatically reschedule most workloads, however virtual machines (VMs) in OpenShift Virtualization are not included in this mechanism as a result of their PVCs. ◦ VMs do not automatically fail over when their host node becomes unhealthy. • To enable failover for VMs, Node Health Check (NHC), Self Node Remediation (SNR), and Fence Agent Remediation (FAR) are used. Node HealthCheck Operator (NHC) Self Node Remediation Operator (SNR) Fence Agents Remediation Operator (FAR) Step 1.　Fault detection with Node Health Check (NHC) Operator • Mark the node to "Unhealthy" when the set threshold is reached. • By creating a custom resource, the node is notiﬁed that it needs to be remediated. Step 2.　Recover using Self Node Remediation Operator (SNR) / Fence Agents Remediation (FAR) Operator • When a custom resource is detected, the node is restarted. • SNR/FAR deletes VMI of virtual machines along their respective processes. Step 1. Fault detection Step 2. Remediation

7 Fence Agents Remediation (FAR) • Hardware/Power based fencing •
Cluster decides to fence • Conﬁrmed when the node has reached the safe state (down) • fence-agent is power-cycling via BMC API ◦ Intelligent Platform Management Interface (IPMI) ◦ RedFish Remediation Implementations FAR is faster, SNR is (potentially) safer for workloads Self Node Remediation (SNR) • Software based fencing • Node decides to fence • Unconﬁrmed when node has reached the safe state (down) • Runs as a Daemonset - one pod per node • Can utilize hardware watchdog

8 Node Health Check

9 Node HealthCheck Operator (NHC) • Node Health Check (NHC)
continuously monitors the health of each node. If a node stays in a problematic state for longer than the conﬁgured time, NHC marks it as Unhealthy. • When a node is marked Unhealthy, NHC creates a custom resource that tells a Remediation Operator (such as SNR or FAR) which node needs remediation. Node HealthCheck Operator (NHC) node-01 node-02 node-03 VMs status: conditions: - type: Ready status: "False" Nodes that have been "False" or "Unknown" for 30 seconds are considered Unhealthy status: conditions: - type: Ready status: "True" status: conditions: - type: Ready status: "True" Specifying a custom resource template to remediate using FAR apiVersion: remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: nhc-worker-vms namespace: openshift-operators spec: selector: matchLabels: node-role.kubernetes.io/worker: "" # Target Node Label unhealthyConditions: - type: Ready status: "Unknown" timeout: 30s - type: Ready status: "False" timeout: 30s minHealthy: 51% # At least this much of the cluster must # be healthy or do not continue maxUnhealthy: 1 # If more than this number of nodes is # unhealthy, do not continue remediationTemplate: apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 kind: FenceAgentsRemediationTemplate name: fence-ipmilan-template NodeHealthCheck custom resource example One or the other, not both

10 • Escalating remediations allow the administrator to define multiple
potential remediation actions ◦ For example, attempt FAR first and, if no success, attempt SNR • Allows for fastest method first (FAR) followed by failsafe method (SNR) to force workload to be rescheduled Escalating Remediations escalatingRemediations: - remediationTemplate: apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 Kind: FenceAgentsRemediationTemplate namespace: openshift-workload-availability name: far-template-fence-ipmilan order: 1 timeout: 300s - remediationTemplate: apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate namespace: openshift-workload-availability name: snr-template-OutOfServiceTaint order: 2 timeout: 30m Escalating remediations example:

11 Self Node Remediation

12 Self Node Remediation Operator (SNR) • Self Node Remediation
(SNR) runs an agent on every node, which can perform a soft reboot if the node becomes unhealthy. • The SNR agent communicates with the API server and peer nodes to confirm whether its own node is truly unhealthy. If confirmed, the agent reboots the node. At the same time, the peer nodes delete the VMIs of workloads that were on the unhealthy node. • This approach prioritizes safety and accuracy, reducing the risk of accidental or unnecessary remediation. Self Node Remediation Operator (SNR) • When the Node Health Check (NHC) creates an SNR custom resource, the API server notifies peer nodes. ◦ Each node regularly checks its health with the API server; if that fails, it double-checks with a peer node. • If the node confirms it is unhealthy, it restarts its own operating system (using a software watchdog). • Peer nodes then notify the API server to delete the VMIs from the unhealthy node. ◦ Because the unhealthy node is marked unschedulable, the VMs are scheduled to another node. • If a node cannot confirm its health with either the API server or a peer node, it will also restart. • Even after recovery, the node does not automatically fail back; workloads stay on the new node until moved, manually or via descheduler. node-01 node-02 node-03 VMs agent agent agent API Server Node HealthCheck Operator (NHC) ① node-03 is Unhealthy (Create SNR Custom Resource) ② node-03 is Unhealthy ③ Am I healthy? ⑤ Soft reboot ④ No ⑥ VMI deletion

13 Self Node Remediation Flows • If the failed node
can communicate with the API server • If the failed node cannot communicate with the API server • If the failed node cannot communicate with either the API server or peer nodes Medik8s : How Self Node Remediation Works? https://www.medik8s.io/remediation/self-node-remediation/how-it-works/

14 Example Self Node Remediation Template apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate
metadata: name: self-node-remediation-outofservicetaint-strategy-template namespace: openshift-workload-availability spec: template: spec: remediationStrategy: OutOfServiceTaint

15 Example Self Node Remediation Conﬁg apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationConfig
metadata: name: self-node-remediation-config namespace: openshift-workload-availability spec: safeTimeToAssumeNodeRebootedSeconds: 180 watchdogFilePath: /dev/watchdog isSoftwareRebootEnabled: true apiServerTimeout: 15s apiCheckInterval: 5s maxApiErrorThreshold: 3 peerApiServerTimeout: 5s peerDialTimeout: 5s peerRequestTimeout: 5s peerUpdateInterval: 15m hostPort: 30001

16 Example Node Health Check with Self Node Remediation apiVersion:
remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: nhc-worker-self spec: minHealthy: 90% remediationTemplate: apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate name: self-node-remediation-outofservicetaint-strategy-template namespace: openshift-workload-availability selector: matchExpressions: - key: node-role.kubernetes.io/control-plane operator: DoesNotExist - key: node-role.kubernetes.io/master operator: DoesNotExist unhealthyConditions: - duration: 60s status: 'False' type: Ready - duration: 60s status: 'Unknown' type: Ready

17 Fence Agents Remediation

18 Fence Agents Remediation Operator (FAR) • Fence Agent Remediation
(FAR) uses a Fence Agent to isolate an unhealthy node and trigger a hard reboot. • In OpenShift Virtualization, this is typically done through a Fence Agent that communicates with the node’s BMC (Baseboard Management Controller), using standards like IPMI or Redﬁsh to send the reboot command. Fence Agents Remediation Operator (FAR) • When the Node Health Check (NHC) creates a FAR custom resource, the FAR Controller immediately detects it and begins remediation. • The FAR Controller fences the affected node, forcing a hard reboot. ◦ At the same time, the API server deletes the Virtual Machine Instance (VMI) running on the unhealthy node. • Because the node is tainted, the new VMI is scheduled onto another node, and the virtual machine starts there. • If the original node later recovers, workloads do not automatically fail back; they remain running on the new node until moved, manually or via descheduler. node-01 node-02 node-03 VMs API Server *1　Power cycle restart Node HealthCheck Operator (NHC) ① node-03 is Unhealthy (Create a FAR custom resource) ④ Hard reboot ② FAR Custom Resource Discovery ③ fence ④ delete VMI

19 Fence Agents Remediation Flows Medik8s : How does Fence
Agents Remediation work? https://www.medik8s.io/remediation/fence-agents-remediation/how-it-work s/#how-does-fence-agents-remediation-work Node A FAR Controller API server NHC reboot fence Node A is unhealthy Node A is unhealthy * Event driven Taint node Reschedule node A’s workloads Schedule node A’s workloads Delete node A’s workload Node A is restored • When a node is marked unhealthy by the Node Health Check (NHC), a Fence Agent Remediation (FAR) Custom Resource (CR) is created. This triggers the remediation process, regardless of the node’s current status. • The FAR Controller works in an event-driven way: as soon as a FAR CR appears, it immediately initiates the fencing action. It does not perform additional health checks on the node before acting. • At the same time, the FAR Controller applies an out of service taint to the unhealthy node. This forces workloads to be evicted and rescheduled onto healthy nodes, minimizing disruption. • Once the fencing and recovery steps are completed, the FAR CR is automatically deleted, signaling that remediation has ﬁnished. * Create FAR CR

20 Example Fence Agents Remediation Template apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 kind: FenceAgentsRemediationTemplate
metadata: name: far-template-fence-ipmilan namespace: openshift-workload-availability spec: template: spec: agent: fence_ipmilan nodeparameters: '--ip': worker1: 192.168.123.1 worker2: 192.168.123.2 worker3: 192.168.123.3 sharedparameters: '--action': reboot '--password': password '--username': admin retryCount: '5' retryInterval: '5s' timeout: '60s'

21 Example Node Health Check with Fence Agents Remediation apiVersion:
remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: nhc-worker-self spec: minHealthy: 90% remediationTemplate: apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 kind: FenceAgentsRemediationTemplate name: far-template-fence-ipmilan namespace: openshift-workload-availability selector: matchExpressions: - key: node-role.kubernetes.io/worker operator: Exists unhealthyConditions: - duration: 60s status: 'False' type: Ready - duration: 60s status: 'Unknown' type: Ready

22 Summary

Engineering benchmarks with pre-release code 23 VM recovery times are
shorter with FAR seconds - OpenShift 4.14 on 6 node bare metal average of best cases - The times depend will vary depending on the hardware - OpenShift Virtualization 4.16 is still under development. seconds

24 VM recovery time Custom Aggressive (FAR) Conﬁguration Example NHC
Defaults Aggressive FAR t4: Workload restarted KubeVirt 15s 15s t3: Remediate Host Medik8s 180s 16s t2: Medik8s HC Medik8s 300s 1s t1: Health Check Kubernetes 50s 50s VM recovery time 545s 82s

25 Fence Agents Remediation (FAR) + FAR has direct feedback
from the API + Robustness + Speed: Reboot is acked by the API - Access to the BMC via network is required - Increases conﬁguration complexity - The BMC being unresponsive can delay recovery - The FAR agent pod being affected by node failure can delay recovery Self Node Remediation vs Fence Agents Remediation Self Node Remediation (SNR) + does not require any management interface, fallback if FAR is not remediating - No quick feedback about the reboot request - No Ack, assume that the node is rebooting with buffer time

linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 26 Thank you

Ask an OpenShift Expert - Ep 159 - Workload Ava...

Ask an OpenShift Expert - Ep 159 - Workload Availability

Red Hat Livestreaming

More Decks by Red Hat Livestreaming

Other Decks in Technology

Featured

Transcript

1 Workload Availability for Red Hat OpenShift Virtualization Andrew Sullivan

2 OpenShift Node OpenShift Node Stateful Set At-Most-One Semantics VM

3 OpenShift Control Plane OpenShift Node OpenShift Node Missing 4

4 OpenShift Control Plane OpenShift Node OpenShift Node Node has

5 Automatic Workload Recovery Conservative values are conﬁgured by default

6 OpenShift fault detection and remediation • OpenShift can detect

7 Fence Agents Remediation (FAR) • Hardware/Power based fencing •

8 Node Health Check

9 Node HealthCheck Operator (NHC) • Node Health Check (NHC)

10 • Escalating remediations allow the administrator to deﬁne multiple

11 Self Node Remediation

12 Self Node Remediation Operator (SNR) • Self Node Remediation

13 Self Node Remediation Flows • If the failed node

14 Example Self Node Remediation Template apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate

15 Example Self Node Remediation Conﬁg apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationConfig

16 Example Node Health Check with Self Node Remediation apiVersion:

17 Fence Agents Remediation

18 Fence Agents Remediation Operator (FAR) • Fence Agent Remediation

19 Fence Agents Remediation Flows Medik8s : How does Fence

20 Example Fence Agents Remediation Template apiVersion: fence-agents-remediation.medik8s.io/v1alpha1 kind: FenceAgentsRemediationTemplate

21 Example Node Health Check with Fence Agents Remediation apiVersion:

22 Summary

Engineering benchmarks with pre-release code 23 VM recovery times are

24 VM recovery time Custom Aggressive (FAR) Conﬁguration Example NHC

25 Fence Agents Remediation (FAR) + FAR has direct feedback

linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHat 26 Thank you