Slide 1

Slide 1 text

Challenges in using server automation for maximal efficiency Prem’Day 2024 Anisse Astier

Slide 2

Slide 2 text

2 Criteo (NASDAQ: CRTO) is the global commerce media company that enables marketers and media owners to drive better commerce outcomes. Its industry leading Commerce Media Platform connects thousands of marketers and media owners to deliver richer consumer experiences from product discovery to purchase. By powering trusted and impactful advertising, Criteo supports an open internet that encourages discovery, innovation, and choice. For more information, please visit www.criteo.com.

Slide 3

Slide 3 text

3 Anisse Astier • Personal website: https://anisse.astier.eu • 14 years as embedded engineer • Staff SRE in Criteo Hardware Team

Slide 4

Slide 4 text

4 Criteo Infrastructure • 60 persons, 9 in the Hardware Team • ~40k servers across 3 continents • 1M+ CPU cores, ~700 PiB • 4 different server vendors • 7 server flavors

Slide 5

Slide 5 text

5 Agenda Through the server lifecycle Selection 1 2 3 Production Support 4 Repair Decommission 5

Slide 6

Slide 6 text

Selection

Slide 7

Slide 7 text

7 Traditional approach An error prone process • Get a new server • Do hours long, manual tests • Realize you picked wrong BIOS option • Start over • Now you need to test another part • Start over • BMC power report inconsistent with PDU • Start over

Slide 8

Slide 8 text

8 Combinatorics play against us • Lots of dimensions o Chassis type o Components (CPU, memory, disks) o Vendor • Need to reduce this (pre-selection)

Slide 9

Slide 9 text

9 Our approach : automated benchmarking • Tooling to orchestrate benchmarks o Reproducible o Modular o Automated • Measure environment: power, thermal, fans • Compare results across servers, chassis

Slide 10

Slide 10 text

10 Introducing hwbench • Open source • Onprem users: use it, give us feedback • Vendors: try it, criticize and give advice • Exchange traceable results and reproducible tests • Contributions welcome! https://github.com/criteo/hwbench

Slide 11

Slide 11 text

11 Hwbench graphs Just a small sample: spiking https://github.com/criteo/hwbench

Slide 12

Slide 12 text

12 Hwbench graphs Just a small sample: compare inside chassis https://github.com/criteo/hwbench

Slide 13

Slide 13 text

13 Hwbench graphs Just a small sample: max perf https://github.com/criteo/hwbench

Slide 14

Slide 14 text

14 Hwbench graphs Just a small sample: max perf per core per watt https://github.com/criteo/hwbench

Slide 15

Slide 15 text

Production

Slide 16

Slide 16 text

16 Criteo production layers Business apps High level platforms Container platforms OS & low-level services Hardware & firmware

Slide 17

Slide 17 text

17 Happy path vs long tail • Biggest part of infra: o Regular updates of firmware o Auto-detection of hardware problems o Auto-release of servers and ticket fixing • Small number of customers: o Windows o Servers as pets, not cattle

Slide 18

Slide 18 text

18 Firmware management is hard • Impossible tasks out of the box: o Listing all firmware, versions and available updates on a given server o Getting an accurate changelog o Common tooling to do updates o Precise control of firmware versions for canary testing • See @Erwan’s OSFC talk: https://www.osfc.io/2022/talks/open-firmware-on-your-infrastructure-not-only-for- hyperscalers/

Slide 19

Slide 19 text

19 Firmware wishlist • Vendor agnostic and automated firmware distribution, targeting SREs (automation) o Toolbox (API) vs packaged tool • Participate in open-source tooling: o LVFS/fwupd o dmidecode (HPE), smartmontools, nvme-cli • Open-source is major differentiator

Slide 20

Slide 20 text

20 Production: Monitoring Redfish: useful but inadequate • New specs are slow to arrive on servers (e.g PLDM) • No certification, only test suites and profiles • Often metrics need vendor-specific handling and reverse engineering (undocumented) • Easy to find inconsistencies in implementations

Slide 21

Slide 21 text

21 Monitoring In-band or out-of-band ? • Out-of-band does not impact customer workloads • Untestable and undocumented error cases (ex: undecoded MCEs or BMC errors) o Hard to automate detection o No indication of gravity or action to take • In-band still necessary (disks, CPU perf throttling...)

Slide 22

Slide 22 text

22 Monitoring Vendors can't see • A vendor engineer only has access a limited set of servers • We monitor and see issues on thousands of a single type of server o « in rare conditions » o A few percents is a lot of impacted servers • Let’s (transparently) share the monitoring metrics! o Vendors can anticipate support

Slide 23

Slide 23 text

Support

Slide 24

Slide 24 text

24 Server generations • 4 people keep hardware working & maintained • Criteo maintains and update software, OS, etc. • Multiple server generations co-exist • Vendors don't upgrade tools & firmware

Slide 25

Slide 25 text

25 Software maintenance business model • EOL @ vendors different than customers o No more bugfixes (see gen+1) o No more new features (see gen+2) • Clear communication in advance • Pay to have software maintenance throughout server lifetime?

Slide 26

Slide 26 text

26 Engineering support • Find a hardware-related bug • Spend hours / days debugging it • Write a detailed report • Open a vendor ticket

Slide 27

Slide 27 text

27 Tech support

Slide 28

Slide 28 text

28 Support • Hard to talk to engineering • Opening doors does not scale for all customers • Lack of hardware-specific tools to extract information (traces, crash logs…) • Scale not considered

Slide 29

Slide 29 text

29 Getting bugs fixed • « in rare conditions » • Sometimes changing parts at scale is impractical • (Good) Written reports have the most impact on the long term, as they get passed around o But you might not see this (around 18 months)

Slide 30

Slide 30 text

Repair

Slide 31

Slide 31 text

31 Repair process • Semi-automated (not fully) pipeline at Criteo • RMA-oriented • Throughput-optimized, not latency (for now)

Slide 32

Slide 32 text

32 Repair process Future changes • Densification (vertical scaling) might change the equation • Longer server lifetime (and warranties) mean more parts to change • APIs!

Slide 33

Slide 33 text

33 Best RMA is no RMA It can be avoided • JEDEC Post Package Repair (PPR) for DIMMs • Firmware remediations o Bad diagnostic in monitoring o Software bug already fixed after bump o Software bug to report

Slide 34

Slide 34 text

34 Back from repair • How to verify? o We have in-house tools o We lack tools to identify « working hardware » • How identical do we want it to be? o Functional equivalents are not always equivalent o No automatable checks

Slide 35

Slide 35 text

Decommission

Slide 36

Slide 36 text

36 Decommission Sustainability

Slide 37

Slide 37 text

37 Decommission Value preservation • We resell decommissioned servers • We care about root-of-trust transfer of ownership for BMCs (see talk by @Vincent on OpenBMC) • Processor fusing isn’t properly thought o Security could be kept without value destruction

Slide 38

Slide 38 text

38 Decommission • Data confidentiality

Slide 39

Slide 39 text

Summary

Slide 40

Slide 40 text

40 Operating principles manifesto https://github.com/criteo/hardware-manifesto We want open-source software everywhere We aim at working only with companies that offer and support open-source software We want to improve things for everyone If ideas and processes can help similar actors, we are happy to share mostly everything. We want privileged access to engineering Support starts at L3+ We want everything to be automation-ready Infra-as-code approach, usually via API We question spec sheets and put promises to the test NDA level data sheets >> marketing presentations We want to see what the future holds Knowledge is key when planning months/years ahead. Best would be to contribute to designs & roadmaps. We want to use our assets for as long as possible 5-year strict minimum. 7 to 9-year is our target We consider all costs TCO is key (power consumption, performance, cost, long-term support,…). We truly care about our environmental footprint Given similar products, environmental impact would be a strong tie-breaker. Openness Expertise Efficiency orientation

Slide 41

Slide 41 text

Thank you! Hardware manifesto https://github.com/criteo/hardware-manifesto hwbench https://github.com/criteo/hwbench