(SNR) runs an agent on every node, which can perform a soft reboot if the node becomes unhealthy. • The SNR agent communicates with the API server and peer nodes to confirm whether its own node is truly unhealthy. If confirmed, the agent reboots the node. At the same time, the peer nodes delete the VMIs of workloads that were on the unhealthy node. • This approach prioritizes safety and accuracy, reducing the risk of accidental or unnecessary remediation. Self Node Remediation Operator (SNR) • When the Node Health Check (NHC) creates an SNR custom resource, the API server notifies peer nodes. ◦ Each node regularly checks its health with the API server; if that fails, it double-checks with a peer node. • If the node confirms it is unhealthy, it restarts its own operating system (using a software watchdog). • Peer nodes then notify the API server to delete the VMIs from the unhealthy node. ◦ Because the unhealthy node is marked unschedulable, the VMs are scheduled to another node. • If a node cannot confirm its health with either the API server or a peer node, it will also restart. • Even after recovery, the node does not automatically fail back; workloads stay on the new node until moved, manually or via descheduler. node-01 node-02 node-03 VMs agent agent agent API Server Node HealthCheck Operator (NHC) ① node-03 is Unhealthy (Create SNR Custom Resource) ② node-03 is Unhealthy ③ Am I healthy? ⑤ Soft reboot ④ No ⑥ VMI deletion