Premday - Challenges in using server automation for maximal efficiency

Challenges in using server automation for maximal efficiency Prem’Day 2024
Anisse Astier

2 Criteo (NASDAQ: CRTO) is the global commerce media company
that enables marketers and media owners to drive better commerce outcomes. Its industry leading Commerce Media Platform connects thousands of marketers and media owners to deliver richer consumer experiences from product discovery to purchase. By powering trusted and impactful advertising, Criteo supports an open internet that encourages discovery, innovation, and choice. For more information, please visit www.criteo.com.

3 Anisse Astier • Personal website: https://anisse.astier.eu • 14 years
as embedded engineer • Staff SRE in Criteo Hardware Team

4 Criteo Infrastructure • 60 persons, 9 in the Hardware
Team • ~40k servers across 3 continents • 1M+ CPU cores, ~700 PiB • 4 different server vendors • 7 server flavors

5 Agenda Through the server lifecycle Selection 1 2 3
Production Support 4 Repair Decommission 5

Selection

7 Traditional approach An error prone process • Get a
new server • Do hours long, manual tests • Realize you picked wrong BIOS option • Start over • Now you need to test another part • Start over • BMC power report inconsistent with PDU • Start over

8 Combinatorics play against us • Lots of dimensions o
Chassis type o Components (CPU, memory, disks) o Vendor • Need to reduce this (pre-selection)

9 Our approach : automated benchmarking • Tooling to orchestrate
benchmarks o Reproducible o Modular o Automated • Measure environment: power, thermal, fans • Compare results across servers, chassis

10 Introducing hwbench • Open source • Onprem users: use
it, give us feedback • Vendors: try it, criticize and give advice • Exchange traceable results and reproducible tests • Contributions welcome! https://github.com/criteo/hwbench

11 Hwbench graphs Just a small sample: spiking https://github.com/criteo/hwbench

12 Hwbench graphs Just a small sample: compare inside chassis
https://github.com/criteo/hwbench

13 Hwbench graphs Just a small sample: max perf https://github.com/criteo/hwbench

14 Hwbench graphs Just a small sample: max perf per
core per watt https://github.com/criteo/hwbench

Production

16 Criteo production layers Business apps High level platforms Container
platforms OS & low-level services Hardware & firmware

17 Happy path vs long tail • Biggest part of
infra: o Regular updates of firmware o Auto-detection of hardware problems o Auto-release of servers and ticket fixing • Small number of customers: o Windows o Servers as pets, not cattle

18 Firmware management is hard • Impossible tasks out of
the box: o Listing all firmware, versions and available updates on a given server o Getting an accurate changelog o Common tooling to do updates o Precise control of firmware versions for canary testing • See @Erwan’s OSFC talk: https://www.osfc.io/2022/talks/open-firmware-on-your-infrastructure-not-only-for- hyperscalers/

19 Firmware wishlist • Vendor agnostic and automated firmware distribution,
targeting SREs (automation) o Toolbox (API) vs packaged tool • Participate in open-source tooling: o LVFS/fwupd o dmidecode (HPE), smartmontools, nvme-cli • Open-source is major differentiator

20 Production: Monitoring Redfish: useful but inadequate • New specs
are slow to arrive on servers (e.g PLDM) • No certification, only test suites and profiles • Often metrics need vendor-specific handling and reverse engineering (undocumented) • Easy to find inconsistencies in implementations

21 Monitoring In-band or out-of-band ? • Out-of-band does not
impact customer workloads • Untestable and undocumented error cases (ex: undecoded MCEs or BMC errors) o Hard to automate detection o No indication of gravity or action to take • In-band still necessary (disks, CPU perf throttling...)

22 Monitoring Vendors can't see • A vendor engineer only
has access a limited set of servers • We monitor and see issues on thousands of a single type of server o « in rare conditions » o A few percents is a lot of impacted servers • Let’s (transparently) share the monitoring metrics! o Vendors can anticipate support

Support

24 Server generations • 4 people keep hardware working &
maintained • Criteo maintains and update software, OS, etc. • Multiple server generations co-exist • Vendors don't upgrade tools & firmware

25 Software maintenance business model • EOL @ vendors different
than customers o No more bugfixes (see gen+1) o No more new features (see gen+2) • Clear communication in advance • Pay to have software maintenance throughout server lifetime?

26 Engineering support • Find a hardware-related bug • Spend
hours / days debugging it • Write a detailed report • Open a vendor ticket

27 Tech support

28 Support • Hard to talk to engineering • Opening
doors does not scale for all customers • Lack of hardware-specific tools to extract information (traces, crash logs…) • Scale not considered

29 Getting bugs fixed • « in rare conditions »
• Sometimes changing parts at scale is impractical • (Good) Written reports have the most impact on the long term, as they get passed around o But you might not see this (around 18 months)

Repair

31 Repair process • Semi-automated (not fully) pipeline at Criteo
• RMA-oriented • Throughput-optimized, not latency (for now)

32 Repair process Future changes • Densification (vertical scaling) might
change the equation • Longer server lifetime (and warranties) mean more parts to change • APIs!

33 Best RMA is no RMA It can be avoided
• JEDEC Post Package Repair (PPR) for DIMMs • Firmware remediations o Bad diagnostic in monitoring o Software bug already fixed after bump o Software bug to report

34 Back from repair • How to verify? o We
have in-house tools o We lack tools to identify « working hardware » • How identical do we want it to be? o Functional equivalents are not always equivalent o No automatable checks

Decommission

36 Decommission Sustainability

37 Decommission Value preservation • We resell decommissioned servers •
We care about root-of-trust transfer of ownership for BMCs (see talk by @Vincent on OpenBMC) • Processor fusing isn’t properly thought o Security could be kept without value destruction

38 Decommission • Data confidentiality

Summary

40 Operating principles manifesto https://github.com/criteo/hardware-manifesto We want open-source software everywhere
We aim at working only with companies that offer and support open-source software We want to improve things for everyone If ideas and processes can help similar actors, we are happy to share mostly everything. We want privileged access to engineering Support starts at L3+ We want everything to be automation-ready Infra-as-code approach, usually via API We question spec sheets and put promises to the test NDA level data sheets >> marketing presentations We want to see what the future holds Knowledge is key when planning months/years ahead. Best would be to contribute to designs & roadmaps. We want to use our assets for as long as possible 5-year strict minimum. 7 to 9-year is our target We consider all costs TCO is key (power consumption, performance, cost, long-term support,…). We truly care about our environmental footprint Given similar products, environmental impact would be a strong tie-breaker. Openness Expertise Efficiency orientation

Thank you! Hardware manifesto https://github.com/criteo/hardware-manifesto hwbench https://github.com/criteo/hwbench

Premday - Challenges in using server automation...

Premday - Challenges in using server automation for maximal efficiency

More Decks by PremDay

Other Decks in Technology

Featured

Transcript