計算力学シミュレーションを支える NVIDIA の最新情報 / 2023-10-26 CMD2023

計算力学シミュレーションを支える NVIDIA の最新技術 Shinnosuke Furuya, Ph.D., HPC Developer Relations |
2023/10/26

• GPU Classification and Generation • NVIDIA Grace CPU •
NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090
Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 L4

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090
Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 L4 Professional Visualization | FP32 Gaming | FP32 VDI | FP32 HPC/AI Computing | FP32 HPC/AI Computing | FP64, FP32

NVIDIA Grace CPU Building Block of the Superchip • High
Performance Power Efficient Cores • 72 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bisection bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 480 GB of data center enhanced LPDDR5X Memory that delivers up to 500 GB/s of memory bandwidth • Coherent Chip-to-Chip Connections • NVLink-C2C with 900 GB/s bandwidth for coherent connection to CPU or GPU • Industry Leading Performance Per Watt • Up to 2X perf / W over today’s leading servers

NVLINK-C2C High Speed Chip to Chip Interconnect • Creates Grace
Hopper and Grace Superchips • Removes the typical cross-socket bottlenecks • Up to 900GB/s of raw bidirectional BW • Same BW as GPU to GPU NVLINK on Hopper • Low power interface - 1.3 pJ/bit • More than 5x more power efficient than PCIe • Enables coherency for both Grace and Grace Hopper superchips GRACE CPU NVLINK C2C 900 GB/s CPU LPDDR5X CPU LPDDR5X ≤ 512 GB/s

NVIDIA Grace for Cloud, AI and HPC Infrastructure Accelerated applications
where CPU performance and system memory size and bandwidth are critical; tightly coupled CPU & GPU for flagship AI & HPC. Most versatile compute platform for scale out. GH200 Grace Hopper Superchip Large Scale AI & HPC CPU-based applications where absolute performance, energy efficiency, and data center density matter, such as scientific computing, data analytics, enterprise and hyperscale computing applications Grace CPU Superchip CPU Computing

NVIDIA Grace CPU Superchip 2X Performance at the Same Power
for the Modern Data Center • High Performance Power Efficient Cores • 144 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bi-section bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 960GB of data center enhanced LPDDR5X Memory that delivers up to1TB/s of memory bandwidth • Fast and Flexible CPU IO • Up to 8x PCIe Gen5 x16 interface • PCIe Gen 5 up to 128GB/s; 2X more bandwidth compared to PCIe Gen 4 • Full NVIDIA Software Stack • AI, Omniverse

NVIDIA Grace is a Compute and Data Movement Architecture NVIDIA
Scalable Coherency Fabric and Distributed Cache Design • 72 high performance Arm Neoverse V2 cores with 4x128b SVE2 • 3.2 TB/s bisection bandwidth • 117MB of L3 cache • Local caching of remote die memory • Background data movement via Cache Switch Network

Low-Power High-Bandwidth Memory Subsystem LPDDR5X Data Center Enhanced Memory •
Optimal balance between bandwidth, energy efficiency and capacity • Up to 1TB/s of raw bidirectional BW • 1/8th power per GB/s vs conventional DDR memory • Similar cost / bit to conventional DDR memory • Data Center class memory with error code correction (ECC) GRACE CPU GRACE CPU NVLINK C2C 900 GB/s CPU LPDDR5X CPU LPDDR5X CPU LPDDR5X CPU LPDDR5X ≤ 512 GB/s ≤ 512 GB/s

NVIDIA Grace CPU Delivers 1.9X HPC Data Center Throughput at
the Same Power Breakthrough Performance and Efficiency 0.6X 0.6X 0.6X 0.7X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.1X 0.0X 0.5X 1.0X 1.5X 2.0X 2.5X Weather WRF MD CP2K Climate NEMO CFD OpenFOAM Server Performance Intel SPR AMD Genoa NVIDIA Grace 0.6X 0.6X 0.7X 0.7X 1.0X 1.0X 1.0X 1.0X 1.7X 1.8X 1.9X 1.9X 0.0X 0.5X 1.0X 1.5X 2.0X 2.5X Weather WRF MD CP2K Climate NEMO CFD OpenFOAM Data Center Throughput Intel SPR AMD Genoa NVIDIA Grace Data Center level projection of NVIDIA Grace Superchip vs x86 flagship 2-socket data center systems (AMD Epyc 9654 and Intel Xeon 8480+). MD: CP2K RPA 2023.1 Climate: NEMO Gyre_Pisces v4.2.0 Weather: CONUS12, 24 hr simulation 4.4.2 CFD: OpenFOAM Motorbike | Large v2212 NVIDIA Grace Superchip performance based on engineering measurements. Results subject to change.

NVIDIA Grace CPU Building Block of the Superchip • High
Performance Power Efficient Cores • 72 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bisection bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 480 GB of data center enhanced LPDDR5X Memory that delivers up to 500 GB/s of memory bandwidth • Coherent Chip-to-Chip Connections • NVLink-C2C with 900 GB/s bandwidth for coherent connection to CPU or GPU • Industry Leading Performance Per Watt • Up to 2X perf / W over today’s leading servers

NVIDIA Hopper H100 GPU Breakthrough Performance and Efficiency for the
Modern Data Center • Highest AI and HPC Performance • 4PF FP8 (6X) | 2PF FP16 (3X) | 1PF TF32 (3X) | 67TF FP64 (3.4X) • 4TB/s (2X), 96GB HBM3 memory • Transformer Engine • 4th generation Tensor Core optimized for Transformer Models • 6X faster on largest transformer models • Highest Utilization Efficiency and Security • 7 Fully isolated & secured instances, 2nd Gen MIG • Fastest, Scalable Interconnect • 4th Gen NVLink 900 GB/s GPU-to-GPU connectivity up to 256 linked GPUs with NVLink Switch System

NVIDIA GH200 Grace Hopper Superchip Processor For The Era of
Accelerated Computing And Generative AI 72 Core Grace CPU | 4 PFLOPS Hopper GPU 144 GB HBM3e | 5 TB/s | 900 GB/s NVLink-C2C 144 Core Grace CPU | 8 PFLOPS Hopper GPU 288 GB HBM3e | 10 TB/s | 900 GB/s NVLink-C2C 72 Core Grace CPU | 4 PFLOPS Hopper GPU 96 GB HBM3 | 4 TB/s | 900 GB/s NVLink-C2C • World’s first HBM3e GPU • Combined 624 GB of fast memory • 1.7x capacity and 1.5x bandwidth vs H100 • Full NVIDIA Compute Stack • Simple to deploy MGX-compatible design • Combined 1.2 TB fast memory • 3.5x capacity and 3x bandwidth vs H100 • Full NVIDIA Compute Stack • 7X bandwidth to GPU vs PCIe Gen 5 • Combined 576 GB of fast memory • 1.2x capacity and bandwidth vs H100 • Full NVIDIA Compute Stack GH200 with HBM3 GH200 with HBM3e NVLink Dual-GH200 System

Energy Efficient Design More Efficient Computation and Data Movement System
Memory (DDR5) 35 pJ/Bit 6.5 pJ/Bit 99 pJ/Flop DP 12 pJ/Flop DP PCIe Gen 5 62 pJ/Flop DP 1.6X less energy System Memory (LPDDR5X) 5 pJ/Bit 7X less energy 12 pJ/Flop DP Equal energy 1.3 pJ/Bit 5X less energy NVLink-C2C

Optimizing Performance Through Power Steering Getting the Most Out of
provisioned power 0 100 200 300 400 500 600 700 Initial Provisioned Power (650W) x86+H100 GPU Heavy Phase w/out Power Steering GH200 GPU Heavy Phase w/ Power Steering Power (W) Chart Title CPU GPU 200W of CPU power shifted to GPU to maximize App perf. Total Power stays fixed Provisioned Power Limit (650W)

GH200 Grace Hopper HPC Platform Unified Memory and Cache Coherence
for Next Gen HPC Performance Fast Access Memory 624GB Memory Bandwidth 5TB/s Partially GPU Accelerated Apps No More PCIe Bottleneck CPU & GPU Cache Coherence Big performance gains with no code changes Incremental code changes yield big gains NVLink-C2C is 7X PCIe BW HPC: Preliminary results comparing DGX A100, DGX H100 and GH200 with HBM3 systems. OpenFOAM based on MotorBike, NAMD with Colvars, CP2K with RPA,

AI/ML for Physical Systems What is different? Physics based machine
learning Physical system ≈ system governed by first principles or governing partial differential equations

Promise of PINNs Data Only vs PINN: Solving The Data
Problem • Neural Networks are functions that can be modified to represent almost any other function • Target function: f(x) • NN to approximate it: u(x;W)≅f(x) • Training: find weights W that minimize mismatch at selected data points • Given enough data, Neural Networks can approximate almost any function to any degree of accuracy • But… collecting field data may not always be possible • If we understand the physical laws behind the data, then we can generate enough ! " = "! + 1 &" &"" ' " = 0

Open-Source AI Toolkit for Physics-based ML Modulus • Python based
APIs for ease of use • Import your PyTorch model* • Reference case studies and recipes as starting points • Facilitates open collaboration within the Physics-ML scientific community • Model architecture Zoo • Well documented features and functionality for ease of use • Source code: https://github.com/NVIDIA/modulus

To Learn More about Modulus • はじめてのNVIDIA Modulus –Physics-ML 物理に基づい
た機械学習による工学シミュレーション– • 著者: 柴田良一 (著)、NVIDIA (監修) • 出版社: 工学社 • 発売日: 2023年7月26日 • https://honto.jp/netstore/pd-book_32621454.html • CAE 懇話会解析塾: Modulus 入門 • 日時: 2023年12月7日 (木) 10:00-16:00 / オンデマンド • 講師: 柴田良一教授 (岐阜高専)、丹愛彦 (NVIDIA) • 会場: 新大阪 • 受講料: 有償 • http://www.cae21.org/kaisekijuku2023/modulus_2023.s html

Additional information

NVIDIA at SC23 November 12-17, 2023 | Colorado Convention Center,
Denver, Colorado • Many NVIDIA sessions: • 7 Tutorials • 5 Workshops • 4 Exhibitor Forums • 4 Papers • 15 Birds of a Feathers (BoF) • 2 Panels • https://www.nvidia.com/en- us/events/supercomputing/

NVIDIA GTC March 18-21, 2024 | San Jose Convention Center,
San Jose, CA and Virtual Come to GTC—the conference for the era of AI—to connect with a dream team of industry luminaries, developers, researchers, and business experts shaping what’s next in AI and accelerated computing. From the highly anticipated keynote by NVIDIA CEO Jensen Huang to over 600 inspiring sessions, 200+ exhibits, and tons of networking events, GTC delivers something for every technical level and interest area. Be sure to save your spot for this transformative event. You can even take advantage of early-bird pricing when you register by February 7. March 18-21, 2024 | www.nvidia.com/gtc The In-Person GTC Experience Is Back Plask

計算力学シミュレーションを支える NVIDIA の最新情報 / 2023-10-26 CMD2023

計算力学シミュレーションを支える NVIDIA の最新情報 / 2023-10-26 CMD2023

Shinnosuke Furuya

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript

計算力学シミュレーションを支える NVIDIA の最新技術 Shinnosuke Furuya, Ph.D., HPC Developer Relations |

• GPU Classification and Generation • NVIDIA Grace CPU •

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090

• GPU Classification and Generation • NVIDIA Grace CPU •

NVIDIA Grace CPU Building Block of the Superchip • High

NVLINK-C2C High Speed Chip to Chip Interconnect • Creates Grace

NVIDIA Grace for Cloud, AI and HPC Infrastructure Accelerated applications

• GPU Classification and Generation • NVIDIA Grace CPU •

NVIDIA Grace CPU Superchip 2X Performance at the Same Power

NVIDIA Grace is a Compute and Data Movement Architecture NVIDIA

Low-Power High-Bandwidth Memory Subsystem LPDDR5X Data Center Enhanced Memory •

NVIDIA Grace CPU Delivers 1.9X HPC Data Center Throughput at

• GPU Classification and Generation • NVIDIA Grace CPU •

NVIDIA Grace CPU Building Block of the Superchip • High

NVIDIA Hopper H100 GPU Breakthrough Performance and Efficiency for the

NVIDIA GH200 Grace Hopper Superchip Processor For The Era of

Energy Efficient Design More Efficient Computation and Data Movement System

Optimizing Performance Through Power Steering Getting the Most Out of

GH200 Grace Hopper HPC Platform Unified Memory and Cache Coherence

• GPU Classification and Generation • NVIDIA Grace CPU •

AI/ML for Physical Systems What is different? Physics based machine

Promise of PINNs Data Only vs PINN: Solving The Data

Open-Source AI Toolkit for Physics-based ML Modulus • Python based

To Learn More about Modulus • はじめてのNVIDIA Modulus –Physics-ML 物理に基づい

Additional information

NVIDIA at SC23 November 12-17, 2023 | Colorado Convention Center,

NVIDIA GTC March 18-21, 2024 | San Jose Convention Center,