A European cloud provider, founded in 1999. The most complete cloud ecosystem in Europe • Multi-AZ redundancy • Operates sustainable data centers in France, the Netherlands and Poland • 2nd life : servers reused for 10 years (vs industry standard of 3 - 4 ) : reduces CO2 and e-waste
Joined Scaleway early 2018 after 19 years at Sun Microsystems (and Oracle) • VP Hardware Engineering • Part of the Operations department • Scope of responsibilities • HW architecture / solutions and validations • Rack Level Design • In-house tools dev (configuration, quality, manufacturing) • HW installed base maintenance
In quantity : dozens of thousands of servers, • Geographically : operated in several locations / regions -> multi-AZ ◦ And different DC infrastructures • Products : covering a large variety of products and so several typologies of servers (intensive compute, AI, storage, commodity hardware, …) • Sustainability / Free and adiabatic cooling • Continuous deployment Large …
Servers in production and maintained for 10 years, ◦ Sometimes up to 14 years … • Multi vendor with a mix of numerous generations of chassis/servers, ◦ Up to 37 types of chassis at some point in time • Mix of many flavors and versions of firmware, • Mix of several technologies ( DDR, HDD, SSD, HW RAID …) … and diverse
different purposes → diversity of chassis often is a requirement ◦ Use several chassis for a SCW product ◦ A chassis server must fit as many customers requirements as possible ◦ However, we would like to limit the number of chassis types • One-size-fits-all is not a solution for us ◦ -> 1U, 2U, 3U, 4U, … • Rear IO not ideal ◦ Esp. when operating with hot aisle / adiabatic
the cooling capacity and power consumption ◦ Chassis size ( 2U ) ◦ Fan size • A lot of possible configurations/form factors ◦ Generally overkill ◦ But hard to find suitable configurations • Personalization and customization requires the capability to be flexible ◦ Baremetal and Cloud servers don’t always have the same requirements
better for ◦ rack level density ◦ space optimization ◦ Cabling ◦ Serviceability ◦ Cooling • Configurations flexibility -> build a “catalogue” ◦ Diskless + bracket ▪ For a better control on the disks and DIMMs references ▪ For spare parts management as well ◦ Manufacturing Level 6 + CPU (incl. tests) in some cases
+ DC - SCM for better chassis configurations ◦ Interoperability across platform / DC - SCM ◦ Consistent form factors and interfaces ◦ Improve modularity ◦ And helps to limit the number of chassis “flavors” • HW replacement → RMA ◦ Process TBC ◦ Esp. for DC - SCM / interoperability ? • Chassis/server customization / de-feature can provide simpler configuration ◦ Cheaper but also … ◦ Less components / less defect ◦ Easier maintenance
at the rack level rather than chassis/server (core) level • We are not the “end customer” -> for power consumption purpose, we have to predict and anticipate ◦ The work load average ◦ The usage rate ◦ Redundancy -> availability ◦ etc…
not a suitable solution for our business ◦ Dedicated to specific cases ◦ Blast radius • Cooling → mix air / DLC ◦ Low end -> air cooling ◦ High end -> DLC • Impact the Net ports costs and availability • Standard when dealing with constraints ◦ Different power designs
between ◦ work load average ◦ The usage rate ◦ Power redundancy • Mix of air and DLC cooling ◦ Driven by the large and diverse installed base ◦ Product ◦ Cost effectiveness ◦ DC infra readiness / room level ◦ Vs adaptability
in a cpu with the resulting power consumption is not THE solution • Open Rack v3 (incl. Front IO ) ◦ Density ◦ Assembly ◦ Maintenance / S12Y ◦ Power efficiency
Transition to Redfish is not straightforward ▪ For instance, inband comm is not standardized … yet ◦ Compatibility / interoperability / proprietary stack • Multi chassis + continuous deployment ◦ Management using different FW flavors ◦ Generations, versions The Maintainability perspective
Difficult access to engineering in the early days / validations steps ◦ Access to spare parts through level of support • Access to parts ◦ Extend the servers life makes spare parts availability more challenging ◦ Maintain many chassis types implies many spare parts SKUs ▪ No compatibility across manufacturers and across chassis ▪ Complete systems do not help to limit the number of references The Maintainability perspective
and configure the HW ◦ -> tools must be agnostic to the HW flavor • Complete systems implies no control on the disks and DIMM ◦ For short and long term support, this is not helpful for maintenance • "Packaging" is not ideal for sustainability ◦ Too many palets / too much waste ◦ Shipment costs The Maintainability perspective
Redfish implementation ◦ Allow adding new features to address own use-cases ◦ Helps to optimize and standardize the interface/ interaction with FW ▪ Requires in-house skills ▪ Helps with internal HW solution ◦ Pending questions ▪ Source code delivery and RMA (rollback) ? ▪ Adoption and compliance ? ▪ Transfer of ownership / signature ?
to be more appropriate ◦ Engineering access in the early days ◦ Spare parts available for a more appropriate period ◦ During “RUN” time, skip the 1st level of support to get access to spare parts ◦ L6 / diskless + brackets ▪ Helps to control the spare parts stocks and references ◦ customization / de-feature -> Less components, less failure