Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PremDay #2 - DLC deployment: implementation cha...

Avatar for PremDay PremDay
April 18, 2025

PremDay #2 - DLC deployment: implementation challenges, risks & benefits

Alexis Carrion from BNP Paribas share his experience on 3 Direct-Liquid Cooling deployment projects and detail the associated challenges, fails and success factors.
He details the chosen technical solution, the logic between in-rack vs. end-of-row CDUs, and a financial analysis between the implementation vs. Operation Costs & ROI.

Avatar for PremDay

PremDay

April 18, 2025
Tweet

More Decks by PremDay

Other Decks in Technology

Transcript

  1. CIB ITO HPC GRID COMPUTATION AND DLC REX APRIL 2025

    ALEXIS CARRION LIMITED BROADCAST
  2. Classification : Internal AGENDA 2 1 Introduction Finance Pricing &

    Edging 5 Questions 4 REX on DLC setup & operations 2 Efficiency driven transformation 3 Introduction to DLC
  3. Classification : Internal INTRODUCTION TO FINANCE PRICING & EDGING 1st

    April 2025 4 Grid computing or High Performance Computing (HPC) is a group of networked computers working together as a virtual supercomputer to perform large tasks, such as computing large scale simulations. Business Context ▪ Global Markets sells financial derivatives, which entail a level of risk which needs to be managed throughout the lifetime of the transactions ▪ When selling the product, GM must know how much it should be sold for, then the product should be hedged by Trading during its lifetime, and Risks should then monitor the sensitivity to market parameters ▪ All price and risk computations are performed by a piece of in-house software developed by the Global Markets Quantitative Research (GMQR) team, called a Pricer. Purpose is to compute financial valuations of derivatives and associated risks ▪ In practice, the size and complexity of BNPP markets activities (especially market making and real time answers to client quotes) implies very large scale computations, distributed over several datacentres, and the associated resources ( > 260 000 physical cores for GM Pricers) ▪ Computations are currently distributed over 5 geographical areas ( FR, UK, Iceland, Sweden, Norway). The loss of one geographical area carries a significant impact (overloading the other sites and degrading performance); hence the need for several distinct compute farms ▪ As part of our enhanced resiliency & elasticity strategy, implementation of a spawning capability on a virtual cloud based Datacentre and increase of overall number of HPC locations by 1 to 2 additional sites/year Computation timeliness expectations ▪ Sales to respond to clients: few seconds ▪ Traders for pricing / hedging: few minutes up to 10 minutes ▪ RISK (Market risks monitoring) / Regulators requirements: few hours Continuous optimizations of HPC software, HPC hardware and HPC hosting are key levers to maintain our high competitivity on HPC and making it a strong asset for our Business Resiliency and agility forcing us to have several distinct compute farms for resiliency purposes As part of BNPP strategy regarding environmental impacts (CPU cooling technology, heat recycling), we are carefully selecting sites, hardware and cooling technologies to make our Compute more and more sustainable (aiming Carbon neutrality by 2030 on HPC scope) Load is spread across 24 hours, capacity access priority is given to Sales over Traders and to Traders over Regulators which allows an optimized use of the cores.
  4. Classification : Internal COST DRIVEN TRANSFORMATION 1st April 2025 6

    BNPP DC 3-8 kW/rack Nanterre 22 kW/rack UK 10 kW/rack Iceland 10 kW/rack Sweden 40 kW/rack Norway 60 kW/rack Paris NEW 2025 90 kW/rack Nordics NEW 2025 100 kW/rack ▪ Progressively shift to last gen sites allowing ▪ renewable energy at low price ▪ Fatal heat valorising (neg carbon footprint) ▪ Cold climate ▪ Liquid cooling availability ▪ Ultra high Density (up to 200kW/rack) ▪ Multiply the number of sites. New site is opened every 65k cores ▪ Leverage on refresh process to populate and re- balance the size of sites ▪ Generalize efficient servers (last gen AMD CPU – currently EPYC 9005 series Turin 384 cores/server ▪ Savings on hosting and power compensate major financial hardware refresh effort Liquid cooling DC MS Azure Cloud Air cooling DC
  5. Classification : Internal ▪ Optimization of infrastructure and migration to

    Nordics sites reduced our grid operational costs by 2 despite cores volume doubled (cheapest power costs and much higher power efficiency of new sites with a lever effect) ▪ Cost optimisation relies on several optimisations multiplying their effect one another : COST DRIVEN TRANSFORMATION 1st April 2025 7 ▪ Power consumption not only covers the servers’ direct consumption but also the Datacenter cooling systems and all other security/operations infrastructure. Each Datacenter contractually provides a power efficiency ratio (PUE) which is used to include this additional power (cooling, UPS etc.) to the one consumed by their hosted IT devices ▪ Datacenters with a PUE below 1.2, are considered as very efficient (all new HPC Datacenters are now below 1.2) low PUE X lower power consumption / core X more computation power per core X cheaper kWh X cheaper housing for higher density Nordics FR/UK ~0,7 ~0,85 ~0,85 ~0,4 ~0,7 up to 0,4
  6. Classification : Internal DLC PRINCIPLES 1st April 2025 9 ▪

    Cooling Distribution Unit (CDU) could be either End of Row (cooling up to 1.2MW of devices) or In rack (= dedicated to devices hosted in the same rack) ▪ DLC removes 70 to 80% of heat while air cooling should remain for remaining 20 to 30% at a higher temperature than classical air cooling (free chilling up to 35-40 °C) ▪ Liquid Cooled doors could remove the remaining air-cooled calories but would need a cooler inbound temperature (chilled loop) ▪ Average DLC PUE between 1.07 and 1.15 (1.5 to 1.8 for air cooling without adiabatic use) ▪ DLC can cool up to 200kW/rack ▪ Servers can be either natively DLC ready or retrofitted (air vs. DLC servers' prices are similar) ▪ DLC provides excellent power efficiency mainly when connected on a ‘tempered loop’ (no active chilling) In rack CDU Up to 200kW/CDU DLC server CPU cold plates RAM cold plates (optionnal) Drop less Staubli connectors End of row CDU
  7. Classification : Internal DLC IN REAL LIFE 1st April 2025

    10 Norway Datacenter ▪ Contractual PUE capped at 1.15 ▪ Power cost : 0,06 € / kWh (100% renewable) ▪ Current Power density : 60 kW / rack (up to 200kW/rack available on request) Liquid cooled doors 2 End of Row 700kW CDUs PDUs & inside DLC rack Cabling, patching & DLC pluggin
  8. Classification : Internal RETURN ON EXPERIENCE 1st April 2025 12

    Power efficiency ▪ Announced PUE figures (1.07 to 1.15) are realistic but need a dedicated tempered loop to be reached ▪ Liquid cooling is very efficient to ensure stable temperature inside quieter CPUs ▪ Liquid cooling removes between 70-80% of heat for standard systems (CoolIT) but can reach up to 95% ▪ Remaining fatal heat can be removed with Liquid Cooled Doors plugged on a Chilled loop ▪ DLC RAM cooling does not provide much of additional power efficiency (~5%) ▪ DLC cooled IT Rooms are much quieter than classical air-cooled rooms ▪ Auto-funded BC for hardware refresh (densification & power consumption); ROI reached between 3 years (Nordics) and 5 years (in situ refresh) Liquid cooling secondary loop ▪ Soften/pure Water based circuits need a strict efficient & permanent water quality control ▪ It’s very difficult if not impossible to recover from a major water quality unbalancing ▪ ‘Coolant’ based circuits are 3% less energy efficient but much more robust against chemical/bacterial/pH/TH variations ▪ Regular water/liquid analysis check service must be included in contract carrying DLC infrastructure or in DataCenter contract (when delegated) ▪ CDU must be monitored as part of critical infrastructure (preferably by DataCenter supplier)
  9. Classification : Internal RETURN ON EXPERIENCE 1st April 2025 13

    Warranty, maintenance ▪ Hardware supplier must carry the secondary DLC circuit including the CDUs (ease overall responsibility discussions in case of leak) ▪ Secondary liquid cooling circuit needs regular maintenance which thus needs to be included in the HW contract to avoid trigge ring additional Smart Hands costs ▪ CDU order includes DataCenter teams training service (done during initial CDU installation) ▪ DLC/CDU maintenance is announced for 5 years. Some companies propose maintenance services beyond the initial 5 years at a reasonable cost (service has not been tested yet as our first DLC infrastructure has been deployed for less than 5 years) DataCenter services suppliers & hardware ▪ DC housing services suppliers usually dedicate CDU per client and advise clients to have a secondary CDU. Client CDU’s inbound & outbound interfaces are then the responsibility demarcation line ▪ DLC Servers cost is similar to Air cooled Servers
  10. Classification : Internal SUCCESS FACTORS 1st April 2025 14 ▪

    Rely on DLC leading Experts and strictly follow their instructions for both facilities and Hardware requirements ✓ Despite DLC system seems quite simple in its principles, BUT many counter intuitive elements may make it unproperly implement ed ▪ Don’t trust your suppliers in the way they set up their DLC infrastructure (exhaustiveness of components and operations teinbound,’ skills) ✓ With DLC expert support, control technical teams’ skills and secondary loop setup to check if it’s compliant to DLC ‘state -of-the-art’ ▪ Include CDU/DLC infrastructure monitoring service in Datacenter housing services suppliers’ contracts ✓ Most of DC suppliers are not used to CDUs operations yet while cooling shall be considered as a Facility management function ✓ Get a quote from Datacenter supplier to integrate CDU in their monitoring system and setup alerts thresholds with CDU supplier ✓ Ensure CDU supplier includes Datacenter related maintenance costs in its contract (a few Smart Hands hours/month) ▪ Don’t mix DLC hardware from different suppliers in the same DLC loop ✓ Demarcation point is CDU inbound, and maintenance & warranty is voided if same loop is irrigating hardware from diverse suppliers ✓ Consider a Rack (+ in-Rack CDU) as the smallest scale for hardware differentiation ▪ Don’t ‘over DLC’ since day-1 as not all hardware are DLC ready yet and DLC for small footprints (<20kW) does not make sense ✓ 30% of remaining air cooling is needed for current DLC hardware so focus on Hybrid Datacenters Air/Liquid for medium term ✓ Conversion is progressive. Pick a realistic forecast and prefer a scalable approach. ✓ But anticipate a very moving & dynamic market trend …
  11. 16 ◼ Classification : Confidential THANK YOU merci danke GRAZIE

    chokrane mèsitak GRACIAS ευχαριστώ dhanyavad ARIGATÔ spas dziękuję спасибо NANDRI teşekkür ederim JËRËJËF MAHALO OBRIGADO
  12. Classification : Internal Disclaimer This presentation has been prepared by

    BNP PARIBAS for informational purposes only. Although the information contained in this presentation has been obtained from sources which BNP PARIBAS believes to be reliable, it has not been independently verified and no representation or warranty, express or implied, is made and no responsibility is or will be accepted by BNP PARIBAS as to or in relation to the accuracy, reliability or completeness of any such information. Opinions expressed herein reflect the judgement of BNP PARIBAS as of the date of this presentation and may be subject to change without notice if BNP PARIBAS becomes aware of any information, whether specific or general, which may have a material impact on any such opinions. BNP PARIBAS will not be responsible for any consequences resulting from the use of this presentation as well as the reliance upon any opinion or statement contained herein or for any omission. This presentation is confidential and may not be reproduced (in whole or in part) nor summarised or distributed without the prior written permission of BNP PARIBAS. © BNP PARIBAS. All rights reserved. 17