who develop QCS and its automation HXP (hardware expertise): Hardware experts who qualify new hardware and maintain fleet knowledge HSU (hardware support): Hardware specialists who maintain our fleet and handle the RUN Context HXP Validation & Design HDV QCS & Automation Integration & Exploitation HSU Maintenance & Run Within Operations is Hardware Engineering
between assembly and shipment to the datacenter. Everything before creates the blueprint. QCS industrializes it. Context Design (HXP) V Integration (HDV) Manufacturing (Artisans) V [[ QCS (HDV) ]] Discovery/Validation/Inventory Ship to DC (Deploy/LOG) Network Config (NET) Platform Bootstrap (ENG) Usage Maintenance (HSU)
→ each zone independently configurable • port-based VLAN assignment → correct image and postboot served per physical zone • PXE boot → ephemeral Ubuntu live environment built fresh daily • `rack-live` / `bench-live` frontends → all monitoring/interactions in one place Artisan workflow becomes: plug it in, power it on, wait for completion The Atelier: Our Controlled Environment Itʼs plugʼnʼplay
recycling old servers - stale Ubuntu/Windows EFI entries existed in BIOS • Impact: broke BIOS configuration task • Fix: implemented boot entry wipe and/or reset tasks Feedback Loops • Assumed: one NIC per machine; lexicographic MAC order = physical port order • Reality: a batch of 20 sleds with two NICs where MAC order was reversed across cards • Impact: cards swapped in DCIM - wrong MACs mapped to wrong physical slots • Fix: PCI address → physical slot mapping Case 2: The EFI Boot Entries Case 1: The MAC Address Inversion • Fans not checked until they arrived dead • Checking our network devicesʼ airflow direction • NIC link tests not exhaustive until a bad cable batch slipped through • Downgraded PCI link status on GPUs Honorable mentions When QCS Was Insufficient (The Honest Slide) Unforeseen issues: Response SLA: local fix within 1 working day, full patch deployed within 2 days.
the improvement Before QCS With QCS Time-to-rack up to several weeks ≤ 2 working days Peak throughput 50 / week 600 / week Hardware errors reaching production multiple near 0
the value Requirements Hardware Validation (HXP) Design (HXP) V Integration (HDV) Manufacturing (Artisans) V [[ QCS (HDV) ]] Discovery/Validation/Inventory Ship to DC (Deploy/LOG) V QCS Data to DCIM (Deploy) Network Config (NET) (optional QCS re-validation) V Platform Bootstrap (ENG) Usage V QCS in tasks (ENG) Maintenance (HSU) V QCS in diag (HSU) The stack is now active across the full hardware lifecycle - not just initial deployment
issues in a controlled environment before production is orders of magnitude cheaper than finding them after deployment. Abstraction is a maintenance contract. Being vendor-agnostic doesn't mean ignoring vendor complexity - it means owning it deliberately, with architecture that isolates change. The feedback loop is the product. Every production escape that becomes a check makes the system more complete. Build the loop, measure the loop. Unified tooling drives velocity. A single team owning all hardware-related tooling improves Time-To-Market at every stage of a CSPʼs activity. Moving forward, openFW and openBMC initiatives are keys in natively reducing these toolingsʼ complexity. What a ride Close