PremDay - Operating a heat-reuse HPC cloud platform: challenges & constraints

qarnot.com • @qarnot Operating a heat-reuse HPC cloud platform :
challenges & constraints qarnot.com • @qarnot

Table of contents FOCUSING ON HEAT REUSE: HOW IT SHAPES
WHAT WE DO INTRODUCTION: WHO WE ARE 2 IMPACTS ON WHAT WE NEED FROM HARDWARE

3 3 WHO WE ARE Qarnot 3 3 3 3

4 Qarnot: HPC cloud service provider 4 4 HPC cloud
services For 3D rendering, CFD and other numerical simulation, banks, … Focused on Low carbon footprint and general resources efficiency In particular through waste heat reuse

5 Qarnot: racks builder and edge DC operator focused on
heat reuse 5 5 We build chassis/racks Tailor-made to optimize heat reuse We install them In a distributed way, on small/medium sites, where heat can be used We use them for our HPC cloud Operating them remotely, over the Internet or private networks, to offer our cloud services We reuse the heat We sell heat to heat networks, for domestic applications, industries, swimming pools, …

6 QARNOT IN A FEW FIGURES Top-tier customers in •
ﬁnancial simulation • • 3D animation • • biotechnology • • CFD • Certiﬁcation and labels reduction in carbon footprint of our clients calculations on average 80% + than 3,000 Customers 2010 creation date 70 employees 2,1 of heat computer recovered in 2022 GWh 70K COMPUTING CORES MORE THAN

7 7 A FEW WORDS ON THE TEAM Qarnot 7
7 7 7

Lead SW Eng Cloud platform team Qarnot | Some people
Lead of the IT team Lead HW Eng Racks/Chassis design team Victor Charles 8 Yoann SW Eng Cloud platform team Alexis CTO Clément CEO Paul

9 9 HOW IT SHAPES WHAT WE DO Focusing on
heat reuse 9 9 9 9

Extract as much energy as possible Inject it directly in
district or domestic heating networks Existing buildings Existing heating facilities CAPTURE VALUABLE HEAT WHERE IT’S NEEDED USING EXISTING INFRASTRUCTURE 10 Heavy focus on heat reuse | How it shapes what we do As hot as possible

11 Heavy focus on heat reuse | How it shapes
what we do • Efficient: extract as much energy as possible, up to 95% efficiency • HOT: domestic hot water needs > 60°C for sanitary purposes • Operating on tap water, can be included in existing water circuits • Plumbing can be operated by a plumber Capture valuable heat Using existing infrastructures Custom rack / chassis design Custom cooling

what we do Where heat is needed Using existing infrastructures Widespread geo-distribution on many small and medium-sized sites Very few people need 1 GW of heat @ 65°C in one place Lots of people need 500 kW of heat @ 65°C

what we do Heavy focus on heat reuse Custom rack/chassis design with custom cooling Geographical distribution

14 14 IMPACTS ON WHAT WE NEED FROM HARDWARE Focusing
on heat reuse 14 14 14 14

15 Heat reuse - impacts on hardware needs 15 15
Some impacts from custom cooling Schematics, thermics, mechanics Some impacts from distribution All remote, secure We’re also (mostly) normal people Most needs are not speciﬁc to us

16 16 CUSTOM COOLING HEAT REUSE 16 16 16 16

CUSTOM COOLING | What we do, need, and want to
know Our cooling is based on cold-plates that replace vendor cooling systems 17 Not just CPU, but also smaller chips, NICs, … Removing fans, cooling blocks, and putting ours instead Know what needs to be cooled and by how much Stripping it down and replacing The hotter it is, the most value the heat has. High-temp components

CUSTOM COOLING | Know what to cool and how much
18 Precise schematics Knowing where every chip that may need cooling is located, to the mm. Stable design No moving around elements between revisions. If unavoidable, we need to know Measuring is possible, but time-consuming / expensive Thermal requirements of everything How hot it can get, what power it dissipates Mechanical requirements related to cooling E.g: pressure between CPU and heat sink Early on Trial & error is possible, but time-consuming / expensive Finding errors late in the cycle is very expensive Knowing early on ⇒ faster time-to-prod for the HW ⇒ faster time-to-market for the cloud service ⇒ more informed decisions, so better TCO

CUSTOM COOLING | Stripping down and replacing 19 Taking it
apart should be easy OCP tool-less “touch the green” motto is WONDERFUL Better yet: we just want the motherboard, naked Warranty void if cooling is taken apart is a alost a no-go for us We don’t need a chassis, or heat sinks. Waste we’re not using ⇒ degrades environmental footprint Waste we have to dispose of ⇒ cost Human work to strip it ⇒ cost We WILL remove vendor cooling No warranty void!

CUSTOM COOLING | High temperature components 20 High CPU TCase
We prever CPUs with a high TCase, as we can heat water hotter, enabling applications without a heat pump All other components too One single 1¢ component dying at 50°C can be a no-go for us. We prefer the 10¢ version that can take 90°C Warmer is more valuable IT’S NOT JUST US “Traditional” DCs beneﬁt too: higher temperature also means easier free cooling, less AC, so low PUE without adiabatic cooling (water waste).

21 21 GEOGRAPHICAL DISTRIBUTION HEAT REUSE 21 21 21 21

“Traditional” DCs Edge for heat reuse DCs • Small /
medium actors will have 10s of them • Big players will have 100s • Hyperscalers may have 1000s 22 GEO DISTRIBUTION | Wide scale distribution of small units • We already have 10s • Soon we’ll have 100s • In 10 years, 1000s or maybe 10.000s IT’S NOT JUST US “Edge computing” is a trend that won’t disappear, independent from heat reuse. Small/medium scalers with 1.000s of edge DCs will be a normal thing.

23 GEO DISTRIBUTION | 1.000s to 10.000s computing sites Physical
intervention is not an option Even just one, not only at scale Pushing a button is very costly It must be fully remote, it must work, it must be forgiving BMC, BIOS, NICs, … ﬁrmware and conﬁg Firmware upgrades must be rock-solid

24 GEO DISTRIBUTION | 1.000s to 10.000s computing sites IT’S
NOT JUST US Pushing a single button is costlier for us, but pushing 10.000 buttons is costly for everyone, even in a single facility.

25 GEO DISTRIBUTION | Varying physical security Some very small
sites Physical security not worth the cost Using existing facilities Physical security not always possible to retroﬁt Using heat facilities Sometimes no choice but to share access with “heating” personnel Varying levels of physical security Across sites Need to compensate with logical security For some customers and some usages, it will be acceptable if done properly. Creates manageability challenges

26 GEO DISTRIBUTION | Logical security Remote attestation We want
to deport the trust decision from the machine. Don’t want to deploy crypto/PKI everywhere. Something like TPM remote attestation With an implementation and tooling making it manageable at scale, zero-touch, reset-able, forgiving. Stable It must not change randomly, not change if we reflash the same FW. Predictible PCR values should be computable offline. Given firmwares, conf, and physically plugged HW, we should be able to know the PCR values even without looking at the real thing. Build new FW ⇒ compute new PCR values ⇒ flash FW ⇒ check PCR values

27 GEO DISTRIBUTION | Logical security IT’S NOT JUST US
Even though “traditional” DCs have a more consistent physical security scheme, an additional layer of logical security is still desired. In any case, it must be manageable at scale.

28 28 WE’RE (MOSTLY) NORMAL PEOPLE ON-PREM OPERATORS 28 28
28 28

ON-PREM OPERATOR | We’re (mostly) normal people 29 YES… …BUT
We do a bit of unusual stuff: • We strip down servers • Custom cooling at machine level • Integrate with non-usual (for DC) cooling/heating networks • Operate in non-usual (for a DC) and “non-consistent” places At the end of the day, we’re just a cloud provider operating edge DCs Other on-prem actors do all or part of the “weird” stuff that we do. We have 90% the same challenges: make the servers work, and make them work at scale.

30 30 MAKE THE SERVERS WORK 30 30 30 30

31 Support: as engineers, we like to talk to engineers
• We hit complex problems • We know how to provide valuable bug reports, minimal reproduction, make experiments, … • We did try to turn it off and on again ON-PREM OPERATOR | Make the servers work Mostly important during bringup. Can be useful after ﬁrmware upgrades.

32 As SW engineers, we can read, write, and debug
code ON-PREM OPERATOR | Make the servers work If firmwares are open source, with a usable build+flash+debug toolchain, and hardware spec easily accessible and understandable, we can fix it ourselves. Even more so if based on Linux, which we’re familiar with. OpenBMC, Coreboot, OSS firmwares And the tooling

33 As a cloud service provider, we have to offer
a diversity of hardware ON-PREM OPERATOR | Make the servers work For manageability, we need to bring back unity as low in the stack as possible. Redﬁsh is not enough: too much diversity. OpenBMC, Coreboot, OSS ﬁrmwares Modular: swappable BMC modules 😍 PXE-bootable BMCs 😍😍

34 34 MAKE THE SERVERS WORK AT SCALE 34 34
34 34

35 As software engineers, we like APIs ON-PREM OPERATOR |
Make the servers work AT SCALE • The HTTP API is the ubiquitous building block • We can use it to integrate with off-the-shelf tools, or homemade tools (inventory, deployment, …) ⇒ If management interfaces are API-ﬁrst, we’ll like it

36 As software engineers, we’re used to write, build and
deploy software at scale ON-PREM OPERATOR | Make the servers work AT SCALE Firmware should be no different: • open source with tooling, so we can customize it, build it, package it • Tooling and deployment APIs so we can deploy it at scale with usual deploy strategies (blue/green, canary, rolling, …) CI/CD for firmware. Servers fleet management is a devops job, should use devops tools and flows.

37 As a cloud service provider, we don’t like downtime
ON-PREM OPERATOR | Make the servers work AT SCALE As much as possible, ﬁrmware upgrades (BMC, NICS, …) without downtime would be a big advantage.

38 Long server life ON-PREM OPERATOR | Make the servers
work AT SCALE Our target for keeping hardware is 10 years, not 3. So: • long-term ﬁrmware upgrades (hint: if OSS, we may maintain it ourselves, alone or with others) • upgradable crypto: for crypto chips, BMC auth, …

39 As rational actors, we think in terms of TCO
ON-PREM OPERATOR | Make the servers work AT SCALE We’re ready to pay a bit more for the hardware if we save on integration costs, day-to-day and long-term operational costs, increased lifetime, …

40 As “original people”, we likely want things you didn’t
think about ON-PREM OPERATOR | Make the servers work in the future And tomorrow, we’ll want things even we didn’t think about. Being able to fork and modify the ﬁrmware is an insurance on the future. OpenBMC, Coreboot, OSS ﬁrmwares And the tooling

qarnot.com • @qarnot qarnot.com • @qarnot Call to action!

42 Let’s build around open source ﬁrmwares CALL TO ACTION
| What’s next Let’s make ﬁrmwares “just another piece of code”

qarnot.com • @qarnot qarnot.com • @qarnot Thank you for listening!

qarnot.com • @qarnot Thank you for listening! Lead SW Eng
Cloud platform team Lead of the IT team Lead HW Eng Racks/Chassis design team Victor Charles Yoann SW Eng Cloud platform team Alexis Let’s talk! Contacts [email protected] Yoann [email protected] Clément - CTO

PremDay - Operating a heat-reuse HPC cloud plat...

PremDay - Operating a heat-reuse HPC cloud platform: challenges & constraints

More Decks by PremDay

Other Decks in Technology

Featured

Transcript