Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Premday - Challenges in using server automation...

Premday - Challenges in using server automation for maximal efficiency

Anisse Astier from Criteo presents 'Challenges in using server automation for maximal efficiency'

PremDay

July 01, 2024
Tweet

More Decks by PremDay

Other Decks in Technology

Transcript

  1. 2 Criteo (NASDAQ: CRTO) is the global commerce media company

    that enables marketers and media owners to drive better commerce outcomes. Its industry leading Commerce Media Platform connects thousands of marketers and media owners to deliver richer consumer experiences from product discovery to purchase. By powering trusted and impactful advertising, Criteo supports an open internet that encourages discovery, innovation, and choice. For more information, please visit www.criteo.com.
  2. 3 Anisse Astier • Personal website: https://anisse.astier.eu • 14 years

    as embedded engineer • Staff SRE in Criteo Hardware Team
  3. 4 Criteo Infrastructure • 60 persons, 9 in the Hardware

    Team • ~40k servers across 3 continents • 1M+ CPU cores, ~700 PiB • 4 different server vendors • 7 server flavors
  4. 5 Agenda Through the server lifecycle Selection 1 2 3

    Production Support 4 Repair Decommission 5
  5. 7 Traditional approach An error prone process • Get a

    new server • Do hours long, manual tests • Realize you picked wrong BIOS option • Start over • Now you need to test another part • Start over • BMC power report inconsistent with PDU • Start over
  6. 8 Combinatorics play against us • Lots of dimensions o

    Chassis type o Components (CPU, memory, disks) o Vendor • Need to reduce this (pre-selection)
  7. 9 Our approach : automated benchmarking • Tooling to orchestrate

    benchmarks o Reproducible o Modular o Automated • Measure environment: power, thermal, fans • Compare results across servers, chassis
  8. 10 Introducing hwbench • Open source • Onprem users: use

    it, give us feedback • Vendors: try it, criticize and give advice • Exchange traceable results and reproducible tests • Contributions welcome! https://github.com/criteo/hwbench
  9. 14 Hwbench graphs Just a small sample: max perf per

    core per watt https://github.com/criteo/hwbench
  10. 16 Criteo production layers Business apps High level platforms Container

    platforms OS & low-level services Hardware & firmware
  11. 17 Happy path vs long tail • Biggest part of

    infra: o Regular updates of firmware o Auto-detection of hardware problems o Auto-release of servers and ticket fixing • Small number of customers: o Windows o Servers as pets, not cattle
  12. 18 Firmware management is hard • Impossible tasks out of

    the box: o Listing all firmware, versions and available updates on a given server o Getting an accurate changelog o Common tooling to do updates o Precise control of firmware versions for canary testing • See @Erwan’s OSFC talk: https://www.osfc.io/2022/talks/open-firmware-on-your-infrastructure-not-only-for- hyperscalers/
  13. 19 Firmware wishlist • Vendor agnostic and automated firmware distribution,

    targeting SREs (automation) o Toolbox (API) vs packaged tool • Participate in open-source tooling: o LVFS/fwupd o dmidecode (HPE), smartmontools, nvme-cli • Open-source is major differentiator
  14. 20 Production: Monitoring Redfish: useful but inadequate • New specs

    are slow to arrive on servers (e.g PLDM) • No certification, only test suites and profiles • Often metrics need vendor-specific handling and reverse engineering (undocumented) • Easy to find inconsistencies in implementations
  15. 21 Monitoring In-band or out-of-band ? • Out-of-band does not

    impact customer workloads • Untestable and undocumented error cases (ex: undecoded MCEs or BMC errors) o Hard to automate detection o No indication of gravity or action to take • In-band still necessary (disks, CPU perf throttling...)
  16. 22 Monitoring Vendors can't see • A vendor engineer only

    has access a limited set of servers • We monitor and see issues on thousands of a single type of server o « in rare conditions » o A few percents is a lot of impacted servers • Let’s (transparently) share the monitoring metrics! o Vendors can anticipate support
  17. 24 Server generations • 4 people keep hardware working &

    maintained • Criteo maintains and update software, OS, etc. • Multiple server generations co-exist • Vendors don't upgrade tools & firmware
  18. 25 Software maintenance business model • EOL @ vendors different

    than customers o No more bugfixes (see gen+1) o No more new features (see gen+2) • Clear communication in advance • Pay to have software maintenance throughout server lifetime?
  19. 26 Engineering support • Find a hardware-related bug • Spend

    hours / days debugging it • Write a detailed report • Open a vendor ticket
  20. 28 Support • Hard to talk to engineering • Opening

    doors does not scale for all customers • Lack of hardware-specific tools to extract information (traces, crash logs…) • Scale not considered
  21. 29 Getting bugs fixed • « in rare conditions »

    • Sometimes changing parts at scale is impractical • (Good) Written reports have the most impact on the long term, as they get passed around o But you might not see this (around 18 months)
  22. 31 Repair process • Semi-automated (not fully) pipeline at Criteo

    • RMA-oriented • Throughput-optimized, not latency (for now)
  23. 32 Repair process Future changes • Densification (vertical scaling) might

    change the equation • Longer server lifetime (and warranties) mean more parts to change • APIs!
  24. 33 Best RMA is no RMA It can be avoided

    • JEDEC Post Package Repair (PPR) for DIMMs • Firmware remediations o Bad diagnostic in monitoring o Software bug already fixed after bump o Software bug to report
  25. 34 Back from repair • How to verify? o We

    have in-house tools o We lack tools to identify « working hardware » • How identical do we want it to be? o Functional equivalents are not always equivalent o No automatable checks
  26. 37 Decommission Value preservation • We resell decommissioned servers •

    We care about root-of-trust transfer of ownership for BMCs (see talk by @Vincent on OpenBMC) • Processor fusing isn’t properly thought o Security could be kept without value destruction
  27. 40 Operating principles manifesto https://github.com/criteo/hardware-manifesto We want open-source software everywhere

    We aim at working only with companies that offer and support open-source software We want to improve things for everyone If ideas and processes can help similar actors, we are happy to share mostly everything. We want privileged access to engineering Support starts at L3+ We want everything to be automation-ready Infra-as-code approach, usually via API We question spec sheets and put promises to the test NDA level data sheets >> marketing presentations We want to see what the future holds Knowledge is key when planning months/years ahead. Best would be to contribute to designs & roadmaps. We want to use our assets for as long as possible 5-year strict minimum. 7 to 9-year is our target We consider all costs TCO is key (power consumption, performance, cost, long-term support,…). We truly care about our environmental footprint Given similar products, environmental impact would be a strong tie-breaker. Openness Expertise Efficiency orientation