Premday #3 - Toolbox for product lifecycle investigation
Criteo & Fastly present their toolbox for product lifecycle and performance investigation including hwbench, bbench, Phoronix Test Suite, Spirent, Yokogawa
Requirements ◦ Numbers-driven server flavor choice or creation • New Design Initiation ◦ Business Case & Requirements ◦ Data-Driven Selection ◦ Navigating GPU-Era Lead Times • Existing design ◦ Business Case & Requirements ◦ Data-Driven Selection ◦ Performance Delta (Better, Worse, Or Equal?) How the hardware lifecycle begins…
Custom load generators and appliances (Spirent) • Power Validation: Physical meter integration (Yokogawa) • The Human Element: Manual testing (ad- hoc/exploratory) Hardware Selection and Tooling Standardized synthetic benchmarking: ◦ hwbench ▪ Performance: CPU, (GPU soon) ▪ Power measurement under different stress scenarios ▪ Thermal response ◦ bbbench ▪ Performance per block device ▪ Performance of all block devices Spreadsheet engineering ◦ (C|G)PU sheets ◦ HDD and SSD sheets ◦ TCO over lifetime Internal APIs ◦ Flavor API and replacement ratios
the Phoronix Test Suite (PTS) for portable, reproducible performance evaluation. • Traffic Generation: Stress-testing architectures using home-grown load generators and specialized test appliances. • Power Validation: Integration of physical power meters (currently in development) to measure true hardware draw. • The Human Element: Critical manual sanity testing to catch intermittent, edge-case anomalies that automated suites miss.
at the OS/Application layer. • The Reality: Intermittent latency, throughput "walls", or unusual behavior are sometimes rooted in silicon (thermal constraints, firmware bugs, power delivery). • The Requirement: A specialized diagnostic arsenal that translates raw, bare-metal telemetry into actionable insights. Application/OS BMC/Firmware/Silicon The Observability Blind Spot
Grafana for fleet-wide performance forensics. • Real-Time Telemetry: Drilling into out-of-band management (OEM iDRAC/iLO and ODM OpenBMC). • Diagnostic Orchestration: Custom manual scripts for automated, low-level data collection during active investigations.
Anomalies • The Issue: Unexplained power spikes impacting system stability. • The Investigation: Utilizing real-time iLO telemetry to isolate the NVMe power draw. • The Validation: Confirming the anomaly via CLI and RESTful APIs (ilorest). 980 W! *Source: IPMI v2.0 Specification*
(Standard SKUs) ◦ Logical and Physical site decom ◦ Automated data wiping (Compliance Standards) ◦ Predictable timelines for scheduling • The "Snowflake" Impact (Non-Standard SKUs) ◦ Logical and Physical site decom ◦ Custom hardware impacts standard tooling ◦ Requires manual intervention and process updates ◦ Alters predictable decom timelines
assets with lifecycle status and many tools for operations: flavor catalog & comparator, racks, network circuits, statistics, stock management, bom generator for projects, rampup/decom/resell, APIs Rackconf computes the cheapest possible rack compositions that respect all physical constraints of racks and DCs. BareMetal Requester (BMR) is a webUI allowing infra owners to declare budgets milestones. Users can then declare forecasted needs for said milestone. The budgeted milestone is then reviewed and adjusted and transformed into volumes.