Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Premday #3 - Rack and Roll: How Our Tool (QCS) ...

Premday #3 - Rack and Roll: How Our Tool (QCS) Accelerates Rack & Server Time-to-Market

Scaleway presents how their tool, named QCS, accelerates rack & server time-to-market

Avatar for Premday

Premday

June 08, 2026

More Decks by Premday

Other Decks in Technology

Transcript

  1. Raymond GRAHAM Hardware Development Tech Lead @ Scaleway “Rack &

    Roll” How Our Tooling Accelerates Rack & Server Time-to-Market (QCS) May 5th 2026
  2. 2 The World Before (The Ugly Truth) Context Where we

    were Requirements Hardware Validation (HXP) Design (HXP) Manufacturing (Artisans) Ship to DC (Deploy/LOG) Network Config (NET) Inventory/Testing (ENG) V Debug/Smart Hands (ENG) V Initial Discovery (ENG) V Platform Bootstrap (ENG) Usage V Multiple task implem (ENG) Maintenance (HSU) V Manual diag (HSU)
  3. 3 Who We Are HDV (hardware development): Multi-role software engineers

    who develop QCS and its automation HXP (hardware expertise): Hardware experts who qualify new hardware and maintain fleet knowledge HSU (hardware support): Hardware specialists who maintain our fleet and handle the RUN Context HXP Validation & Design HDV QCS & Automation Integration & Exploitation HSU Maintenance & Run Within Operations is Hardware Engineering
  4. 4 Requirements Hardware Validation (HXP) Where QCS Lives QCS sits

    between assembly and shipment to the datacenter. Everything before creates the blueprint. QCS industrializes it. Context Design (HXP) V Integration (HDV) Manufacturing (Artisans) V [[ QCS (HDV) ]] Discovery/Validation/Inventory Ship to DC (Deploy/LOG) Network Config (NET) Platform Bootstrap (ENG) Usage Maintenance (HSU)
  5. 5 The Matrix: What We're Actually Dealing With Scale &

    Complexity • 3 OS+kernel combinations • 30+ binaries, roughly half vendor-specific ◦ 90+ when you count versions • 242 firmware files qualified • 205 configuration files • 247 materials profiles • 100s of validation profiles: ◦ 186 at chassis-level ◦ 196 at rack-level Automation coverage: Hundreds of component part numbers resulting in 198 component-specific worker classes being integrated into the stack. Component Manufacturers Models Chassis 12 59 Network devices 2 11 CPUs 4 72 Disks 16 112 Memory 11 144 NICs 3 45 GPUs 1 8 Partial count of components under our management:
  6. 6 Architecture • 7 chassis-level and 4 rack-level test zones

    → each zone independently configurable • port-based VLAN assignment → correct image and postboot served per physical zone • PXE boot → ephemeral Ubuntu live environment built fresh daily • `rack-live` / `bench-live` frontends → all monitoring/interactions in one place Artisan workflow becomes: plug it in, power it on, wait for completion The Atelier: Our Controlled Environment Itʼs plugʼnʼplay
  7. 7 Boot to Validated: The QCS Execution Flow 1. PXE

    Boot 2. Fetch selected profile from bench/rack-backend 3. Auto-discovery & inventory 4. Hardware validation 5. Task list loaded from profile 6. Execute tasks (task-level safeties exist) 7. Write inventory to database Inside QCS (automated): Human safeguards (post-process): Architecture 8. Artisans trigger certification on batch completion → validates actions, updates Jira ticket 9. Project manager sign-off
  8. 8 The Agnostic Architecture: How We Tame 30+ Binaries A

    very limited schema Architecture ⚠ Simplified representation - actual architecture contains additional layers and cross-cutting concerns. Key design decisions: • Design for maintainability not cleverness • Defensive Python stack • Limited OS+kernel combinations • Versioned binary wrappers New binary?  Backport or requalify New hardware variant?  New worker class. (human-readable intent) (abstracted action) (map action/target to a worker) (component/model specific) (subprocess/data handling) Profile ↓ Task ↓ Dispatchers ↓ Workers ↓ Wrappers
  9. 9 The HXP Pipeline: Knowledge into Automation Architecture → Engineering

    Samples (early) → HXP 2 weeks qualification → HDV 2 weeks integration & testing → Manufacturing & QCS Artisans autonomous on arrival Fast-track for urgent projects: 23 days (parallelized) Standard NPI flow:
  10. 10 • Assumed: always handling new machines • Reality: began

    recycling old servers - stale Ubuntu/Windows EFI entries existed in BIOS • Impact: broke BIOS configuration task • Fix: implemented boot entry wipe and/or reset tasks Feedback Loops • Assumed: one NIC per machine; lexicographic MAC order = physical port order • Reality: a batch of 20 sleds with two NICs where MAC order was reversed across cards • Impact: cards swapped in DCIM - wrong MACs mapped to wrong physical slots • Fix: PCI address → physical slot mapping Case 2: The EFI Boot Entries Case 1: The MAC Address Inversion • Fans not checked until they arrived dead • Checking our network devicesʼ airflow direction • NIC link tests not exhaustive until a bad cable batch slipped through • Downgraded PCI link status on GPUs Honorable mentions When QCS Was Insufficient (The Honest Slide) Unforeseen issues: Response SLA: local fix within 1 working day, full patch deployed within 2 days.
  11. 11 The Numbers That Matter Impact A vague idea of

    the improvement Before QCS With QCS Time-to-rack up to several weeks ≤ 2 working days Peak throughput 50 / week 600 / week Hardware errors reaching production multiple near 0
  12. 12 What QCS Enables Downstream Impact Internal use that proves

    the value Requirements Hardware Validation (HXP) Design (HXP) V Integration (HDV) Manufacturing (Artisans) V [[ QCS (HDV) ]] Discovery/Validation/Inventory Ship to DC (Deploy/LOG) V QCS Data to DCIM (Deploy) Network Config (NET) (optional QCS re-validation) V Platform Bootstrap (ENG) Usage V QCS in tasks (ENG) Maintenance (HSU) V QCS in diag (HSU) The stack is now active across the full hardware lifecycle - not just initial deployment
  13. 13 Key Takeaways Validate early, at the source. Catching hardware

    issues in a controlled environment before production is orders of magnitude cheaper than finding them after deployment. Abstraction is a maintenance contract. Being vendor-agnostic doesn't mean ignoring vendor complexity - it means owning it deliberately, with architecture that isolates change. The feedback loop is the product. Every production escape that becomes a check makes the system more complete. Build the loop, measure the loop. Unified tooling drives velocity. A single team owning all hardware-related tooling improves Time-To-Market at every stage of a CSPʼs activity. Moving forward, openFW and openBMC initiatives are keys in natively reducing these toolingsʼ complexity. What a ride Close
  14. Raymond GRAHAM Hardware Development Tech Lead @ Scaleway “Rack &

    Roll” How Our Tooling Accelerates Rack & Server Time-to-Market (QCS) Close
  15. Q&A