PremDay #2 - Prem'Fail - Hadoop and firmware update

Hadoop and firmware update • Vincent Minet • Criteo's Hardware
team

2 Hadoop is at the core of Criteo • Tracking
data is saved in Hadoop • Models training is done in Hadoop • Loosing Hadoop would make for a bad day for everyone SSD • Advisory from our SSD vendor “In rare condiditions, …., premature wear-out, …., update to the latest firmware” • Hadoop’s namenodes have 6 SSD Hardware monitoring and repairs • After careful review, 1 out of 3 SSD on the namenodes… had already failed • That firmware update thing seems important Hadoop

3 Fixing the namenodes Hadoop orchestration • At scale, all
operations need to be orchestrated and takes time • Switching namenodes (a day), rebooting the cluster (weeks) Firmware update infrastructure • At the time, Criteo had no automation for pushing firmware • Firmware update were infrequent and done by hand • Often, servers were rebooted later

4 Update ! Preprod • Decision was taken to update
all firmware at once • A test was done in preprod. • All fine Prod • Let’s go • Misunderstanding: update is run on all cluster • This is fine, what could have possibly gone wrong.

5 This is fine A few days later • Hadoop
team pinged the Hardware team • Rebooted nodes are not coming back • Every nodes, no exception Debugging • Technically, nodes are rebooting fine • But with no network • Kernel console showed the following i40e 0000:5d:00.1: eeprom check failed (-5), Tx/Rx traffic disabled

6 It’s always the network Digging deeper • All affected
nodes are running Linux 4.13 • Test node was running a more recent version • Quick diff between the two showed the following commit Situation assessment • All NICs is the Hadoop cluster are soft bricked • The cluster is still running fine • Netboot still works • And update from a fixed kernel repairs the NVM i40e: avoid NVM acquire deadlock during NVM update This resulted in us accidentally causing NVM acquire timeouts on all devices, causing failed firmware updates which left the eeprom in a corrupt state.

7 Failing with style ! Live patching our kernel •
Wouldn’t be better to fix our running kernel • The kernel has live patching capabilities • But we haven’t compiled them in our kernel • Who needs that. We have the Ftrace infrastructure Custom kernel module • New module with the fixed EEPROM functions • Resolve external refs to i40e functions • Patch i40e_get_eeprom and i40e_set_eeprom Learning from mistakes • Progressive rollout of the custom module

Thanks ! 8

PremDay #2 - Prem'Fail - Hadoop and firmware up...

PremDay #2 - Prem'Fail - Hadoop and firmware update

PremDay

More Decks by PremDay

Other Decks in Technology

Featured

Transcript

Hadoop and firmware update • Vincent Minet • Criteo's Hardware

2 Hadoop is at the core of Criteo • Tracking

3 Fixing the namenodes Hadoop orchestration • At scale, all

4 Update ! Preprod • Decision was taken to update

5 This is fine A few days later • Hadoop

6 It’s always the network Digging deeper • All affected

7 Failing with style ! Live patching our kernel •

Thanks ! 8