Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PremDay #2 - Prem'Fail - Hadoop and firmware up...

Avatar for PremDay PremDay
April 07, 2025

PremDay #2 - Prem'Fail - Hadoop and firmware update

As part of the Prem'Fail track, Vincent Minet from Criteo presents a real horror story where all network adapters from the Hadoop cluster were soft-bricked.

Avatar for PremDay

PremDay

April 07, 2025
Tweet

More Decks by PremDay

Other Decks in Technology

Transcript

  1. 2 Hadoop is at the core of Criteo • Tracking

    data is saved in Hadoop • Models training is done in Hadoop • Loosing Hadoop would make for a bad day for everyone SSD • Advisory from our SSD vendor “In rare condiditions, …., premature wear-out, …., update to the latest firmware” • Hadoop’s namenodes have 6 SSD Hardware monitoring and repairs • After careful review, 1 out of 3 SSD on the namenodes… had already failed • That firmware update thing seems important Hadoop
  2. 3 Fixing the namenodes Hadoop orchestration • At scale, all

    operations need to be orchestrated and takes time • Switching namenodes (a day), rebooting the cluster (weeks) Firmware update infrastructure • At the time, Criteo had no automation for pushing firmware • Firmware update were infrequent and done by hand • Often, servers were rebooted later
  3. 4 Update ! Preprod • Decision was taken to update

    all firmware at once • A test was done in preprod. • All fine Prod • Let’s go • Misunderstanding: update is run on all cluster • This is fine, what could have possibly gone wrong.
  4. 5 This is fine A few days later • Hadoop

    team pinged the Hardware team • Rebooted nodes are not coming back • Every nodes, no exception Debugging • Technically, nodes are rebooting fine • But with no network • Kernel console showed the following i40e 0000:5d:00.1: eeprom check failed (-5), Tx/Rx traffic disabled
  5. 6 It’s always the network Digging deeper • All affected

    nodes are running Linux 4.13 • Test node was running a more recent version • Quick diff between the two showed the following commit Situation assessment • All NICs is the Hadoop cluster are soft bricked • The cluster is still running fine • Netboot still works • And update from a fixed kernel repairs the NVM i40e: avoid NVM acquire deadlock during NVM update This resulted in us accidentally causing NVM acquire timeouts on all devices, causing failed firmware updates which left the eeprom in a corrupt state.
  6. 7 Failing with style ! Live patching our kernel •

    Wouldn’t be better to fix our running kernel • The kernel has live patching capabilities • But we haven’t compiled them in our kernel • Who needs that. We have the Ftrace infrastructure Custom kernel module • New module with the fixed EEPROM functions • Resolve external refs to i40e functions • Patch i40e_get_eeprom and i40e_set_eeprom Learning from mistakes • Progressive rollout of the custom module