PremDay #2 - Prem'Fail - Hadoop and firmware update
As part of the Prem'Fail track, Vincent Minet from Criteo presents a real horror story where all network adapters from the Hadoop cluster were soft-bricked.
data is saved in Hadoop • Models training is done in Hadoop • Loosing Hadoop would make for a bad day for everyone SSD • Advisory from our SSD vendor “In rare condiditions, …., premature wear-out, …., update to the latest firmware” • Hadoop’s namenodes have 6 SSD Hardware monitoring and repairs • After careful review, 1 out of 3 SSD on the namenodes… had already failed • That firmware update thing seems important Hadoop
operations need to be orchestrated and takes time • Switching namenodes (a day), rebooting the cluster (weeks) Firmware update infrastructure • At the time, Criteo had no automation for pushing firmware • Firmware update were infrequent and done by hand • Often, servers were rebooted later
all firmware at once • A test was done in preprod. • All fine Prod • Let’s go • Misunderstanding: update is run on all cluster • This is fine, what could have possibly gone wrong.
team pinged the Hardware team • Rebooted nodes are not coming back • Every nodes, no exception Debugging • Technically, nodes are rebooting fine • But with no network • Kernel console showed the following i40e 0000:5d:00.1: eeprom check failed (-5), Tx/Rx traffic disabled
nodes are running Linux 4.13 • Test node was running a more recent version • Quick diff between the two showed the following commit Situation assessment • All NICs is the Hadoop cluster are soft bricked • The cluster is still running fine • Netboot still works • And update from a fixed kernel repairs the NVM i40e: avoid NVM acquire deadlock during NVM update This resulted in us accidentally causing NVM acquire timeouts on all devices, causing failed firmware updates which left the eeprom in a corrupt state.
Wouldn’t be better to fix our running kernel • The kernel has live patching capabilities • But we haven’t compiled them in our kernel • Who needs that. We have the Ftrace infrastructure Custom kernel module • New module with the fixed EEPROM functions • Resolve external refs to i40e functions • Patch i40e_get_eeprom and i40e_set_eeprom Learning from mistakes • Progressive rollout of the custom module