Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PremDay - Challenges and solutions to operate a...

PremDay - Challenges and solutions to operate a large and diverse on-prem hardware range

Stéphane Dutilleul from Scaleway presents : "Challenges and solutions to operate a large and diverse on-prem hardware range"

Avatar for PremDay

PremDay

July 02, 2024
Tweet

More Decks by PremDay

Other Decks in Technology

Transcript

  1. Prem’Day 2024 Stéphane DUTILLEUL May 16th 2024 Challenges and solutions

    to operate a large and diverse on-prem hardware range
  2. 01 Scaleway at a Glance 02 The Form Factor perspective

    03 The Density perspective 04 The Maintainability perspective Table of content
  3. 5 Who is Scaleway ? Scaleway at a glance •

    A European cloud provider, founded in 1999. The most complete cloud ecosystem in Europe • Multi-AZ redundancy • Operates sustainable data centers in France, the Netherlands and Poland • 2nd life : servers reused for 10 years (vs industry standard of 3 - 4 ) : reduces CO2 and e-waste
  4. 6 Who is Scaleway ? Scaleway at a glance •

    Late 2023 : • joined the OCP Community • open an OCP Experience Center • https://eu.osfci.tech/ci/
  5. 7 Who is Stéphane ? Scaleway at a glance •

    Joined Scaleway early 2018 after 19 years at Sun Microsystems (and Oracle) • VP Hardware Engineering • Part of the Operations department • Scope of responsibilities • HW architecture / solutions and validations • Rack Level Design • In-house tools dev (configuration, quality, manufacturing) • HW installed base maintenance
  6. 9 On premises installed base Scaleway at a glance •

    In quantity : dozens of thousands of servers, • Geographically : operated in several locations / regions -> multi-AZ ◦ And different DC infrastructures • Products : covering a large variety of products and so several typologies of servers (intensive compute, AI, storage, commodity hardware, …) • Sustainability / Free and adiabatic cooling • Continuous deployment Large …
  7. 10 On premises installed base Scaleway at a glance •

    Servers in production and maintained for 10 years, ◦ Sometimes up to 14 years … • Multi vendor with a mix of numerous generations of chassis/servers, ◦ Up to 37 types of chassis at some point in time • Mix of many flavors and versions of firmware, • Mix of several technologies ( DDR, HDD, SSD, HW RAID …) … and diverse
  8. 12 Challenges The Form Factor perspective • Many products serving

    different purposes → diversity of chassis often is a requirement ◦ Use several chassis for a SCW product ◦ A chassis server must fit as many customers requirements as possible ◦ However, we would like to limit the number of chassis types • One-size-fits-all is not a solution for us ◦ -> 1U, 2U, 3U, 4U, … • Rear IO not ideal ◦ Esp. when operating with hot aisle / adiabatic
  9. 13 Challenges The Form Factor perspective • Obvious impact on

    the cooling capacity and power consumption ◦ Chassis size ( 2U ) ◦ Fan size • A lot of possible configurations/form factors ◦ Generally overkill ◦ But hard to find suitable configurations • Personalization and customization requires the capability to be flexible ◦ Baremetal and Cloud servers don’t always have the same requirements
  10. 14 Solutions The Form Factor perspective • Front IO ->

    better for ◦ rack level density ◦ space optimization ◦ Cabling ◦ Serviceability ◦ Cooling • Configurations flexibility -> build a “catalogue” ◦ Diskless + bracket ▪ For a better control on the disks and DIMMs references ▪ For spare parts management as well ◦ Manufacturing Level 6 + CPU (incl. tests) in some cases
  11. 15 Solutions The Form Factor perspective • DC - MHS

    + DC - SCM for better chassis configurations ◦ Interoperability across platform / DC - SCM ◦ Consistent form factors and interfaces ◦ Improve modularity ◦ And helps to limit the number of chassis “flavors” • HW replacement → RMA ◦ Process TBC ◦ Esp. for DC - SCM / interoperability ? • Chassis/server customization / de-feature can provide simpler configuration ◦ Cheaper but also … ◦ Less components / less defect ◦ Easier maintenance
  12. 17 Challenges The Density perspective • Various products -> density

    at the rack level rather than chassis/server (core) level • We are not the “end customer” -> for power consumption purpose, we have to predict and anticipate ◦ The work load average ◦ The usage rate ◦ Redundancy -> availability ◦ etc…
  13. 18 Challenges The Density perspective • Dual socket MB is

    not a suitable solution for our business ◦ Dedicated to specific cases ◦ Blast radius • Cooling → mix air / DLC ◦ Low end -> air cooling ◦ High end -> DLC • Impact the Net ports costs and availability • Standard when dealing with constraints ◦ Different power designs
  14. 19 Solutions The Density perspective • Always define the trade-off

    between ◦ work load average ◦ The usage rate ◦ Power redundancy • Mix of air and DLC cooling ◦ Driven by the large and diverse installed base ◦ Product ◦ Cost effectiveness ◦ DC infra readiness / room level ◦ Vs adaptability
  15. 20 Solutions The Density perspective • More and more cores

    in a cpu with the resulting power consumption is not THE solution • Open Rack v3 (incl. Front IO ) ◦ Density ◦ Assembly ◦ Maintenance / S12Y ◦ Power efficiency
  16. 22 Challenges • FW management ◦ Outdated IPMI interface ◦

    Transition to Redfish is not straightforward ▪ For instance, inband comm is not standardized … yet ◦ Compatibility / interoperability / proprietary stack • Multi chassis + continuous deployment ◦ Management using different FW flavors ◦ Generations, versions The Maintainability perspective
  17. 23 Challenges • Support model is generally not appropriate ◦

    Difficult access to engineering in the early days / validations steps ◦ Access to spare parts through level of support • Access to parts ◦ Extend the servers life makes spare parts availability more challenging ◦ Maintain many chassis types implies many spare parts SKUs ▪ No compatibility across manufacturers and across chassis ▪ Complete systems do not help to limit the number of references The Maintainability perspective
  18. 24 Challenges • Maintain the required in-house tools to control

    and configure the HW ◦ -> tools must be agnostic to the HW flavor • Complete systems implies no control on the disks and DIMM ◦ For short and long term support, this is not helpful for maintenance • "Packaging" is not ideal for sustainability ◦ Too many palets / too much waste ◦ Shipment costs The Maintainability perspective
  19. 25 Solutions The Maintainability perspective • Open BMC ◦ Native

    Redfish implementation ◦ Allow adding new features to address own use-cases ◦ Helps to optimize and standardize the interface/ interaction with FW ▪ Requires in-house skills ▪ Helps with internal HW solution ◦ Pending questions ▪ Source code delivery and RMA (rollback) ? ▪ Adoption and compliance ? ▪ Transfer of ownership / signature ?
  20. 26 Solutions The Maintainability perspective • Experience Center ◦ https://eu.osfci.tech/ci/

    • M - CRPS ◦ Example of how to reduce the number of PSU references • bulk packaging (no cardboard, cords, DVD, guides …)
  21. 27 Solutions The Maintainability perspective • Adapt the service model

    to be more appropriate ◦ Engineering access in the early days ◦ Spare parts available for a more appropriate period ◦ During “RUN” time, skip the 1st level of support to get access to spare parts ◦ L6 / diskless + brackets ▪ Helps to control the spare parts stocks and references ◦ customization / de-feature -> Less components, less failure
  22. Q&A