Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Premday #3 - Toolbox for product lifecycle inve...

Premday #3 - Toolbox for product lifecycle investigation

Criteo & Fastly present their toolbox for product lifecycle and performance investigation including hwbench, bbench, Phoronix Test Suite, Spirent, Yokogawa

Avatar for Premday

Premday

June 08, 2026

More Decks by Premday

Other Decks in Technology

Transcript

  1. • Capacity planning and new projects ◦ Business Case &

    Requirements ◦ Numbers-driven server flavor choice or creation • New Design Initiation ◦ Business Case & Requirements ◦ Data-Driven Selection ◦ Navigating GPU-Era Lead Times • Existing design ◦ Business Case & Requirements ◦ Data-Driven Selection ◦ Performance Delta (Better, Worse, Or Equal?) How the hardware lifecycle begins…
  2. • Standardized Benchmarking: Phoronix Test Suite (PTS) • Traffic Generation:

    Custom load generators and appliances (Spirent) • Power Validation: Physical meter integration (Yokogawa) • The Human Element: Manual testing (ad- hoc/exploratory) Hardware Selection and Tooling Standardized synthetic benchmarking: ◦ hwbench ▪ Performance: CPU, (GPU soon) ▪ Power measurement under different stress scenarios ▪ Thermal response ◦ bbbench ▪ Performance per block device ▪ Performance of all block devices Spreadsheet engineering ◦ (C|G)PU sheets ◦ HDD and SSD sheets ◦ TCO over lifetime Internal APIs ◦ Flavor API and replacement ratios
  3. Establishing the Baseline Before the Edge • Standardized Benchmarking: Utilizing

    the Phoronix Test Suite (PTS) for portable, reproducible performance evaluation. • Traffic Generation: Stress-testing architectures using home-grown load generators and specialized test appliances. • Power Validation: Integration of physical power meters (currently in development) to measure true hardware draw. • The Human Element: Critical manual sanity testing to catch intermittent, edge-case anomalies that automated suites miss.
  4. Benchmarking servers at Thousands of testing scenarios for a small

    team requires automation Scientific protocol requires reproducibility and precision Technical partnership means transparency as in open-source
  5. github.com/premday/bbbench bbbench is a block-device benchmark tool. Discovers drives, generates

    fio job files from templates, runs benchmarks synchronously across multiple drives, and serves results via a webUI.
  6. The Observability Gap • The Limitation: Traditional observability frameworks stop

    at the OS/Application layer. • The Reality: Intermittent latency, throughput "walls", or unusual behavior are sometimes rooted in silicon (thermal constraints, firmware bugs, power delivery). • The Requirement: A specialized diagnostic arsenal that translates raw, bare-metal telemetry into actionable insights. Application/OS BMC/Firmware/Silicon The Observability Blind Spot
  7. The Fastly Hardware Toolbox • Historical Macro-Data: Thanos Query and

    Grafana for fleet-wide performance forensics. • Real-Time Telemetry: Drilling into out-of-band management (OEM iDRAC/iLO and ODM OpenBMC). • Diagnostic Orchestration: Custom manual scripts for automated, low-level data collection during active investigations.
  8. Real-Time Forensics: NVMe Power Spiking Case Study: Diagnosing NVMe Power

    Anomalies • The Issue: Unexplained power spikes impacting system stability. • The Investigation: Utilizing real-time iLO telemetry to isolate the NVMe power draw. • The Validation: Confirming the anomaly via CLI and RESTful APIs (ilorest). 980 W! *Source: IPMI v2.0 Specification*
  9. From Tool Queries to Physical Interventions • Bridging the "Query

    Gap": Simplifying Grafana graphs track long-term power consumption trends. • Physical Layer Forensics: When Out-of-Band (OOB) management fails, physical intervention is required. • Direct Hardware Interfacing: Utilizing vendor-specific physical harnesses (e.g., Mellanox MTUSB cables) for low-level debugging
  10. The End of The Line - Decommissioning • Mainline Decommissioning

    (Standard SKUs) ◦ Logical and Physical site decom ◦ Automated data wiping (Compliance Standards) ◦ Predictable timelines for scheduling • The "Snowflake" Impact (Non-Standard SKUs) ◦ Logical and Physical site decom ◦ Custom hardware impacts standard tooling ◦ Requires manual intervention and process updates ◦ Alters predictable decom timelines
  11. CMDB and Forecasting tools Rackguru is a super-CMDB for bare-metal

    assets with lifecycle status and many tools for operations: flavor catalog & comparator, racks, network circuits, statistics, stock management, bom generator for projects, rampup/decom/resell, APIs Rackconf computes the cheapest possible rack compositions that respect all physical constraints of racks and DCs. BareMetal Requester (BMR) is a webUI allowing infra owners to declare budgets milestones. Users can then declare forecasted needs for said milestone. The budgeted milestone is then reviewed and adjusted and transformed into volumes.