Slide 1

Slide 1 text

計算力学シミュレーションを支える NVIDIA の最新技術 Shinnosuke Furuya, Ph.D., HPC Developer Relations | 2023/10/26

Slide 2

Slide 2 text

• GPU Classification and Generation • NVIDIA Grace CPU • NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda

Slide 3

Slide 3 text

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090 Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 L4

Slide 4

Slide 4 text

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090 Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 L4 Professional Visualization | FP32 Gaming | FP32 VDI | FP32 HPC/AI Computing | FP32 HPC/AI Computing | FP64, FP32

Slide 5

Slide 5 text

• GPU Classification and Generation • NVIDIA Grace CPU • NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda

Slide 6

Slide 6 text

NVIDIA Grace CPU Building Block of the Superchip • High Performance Power Efficient Cores • 72 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bisection bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 480 GB of data center enhanced LPDDR5X Memory that delivers up to 500 GB/s of memory bandwidth • Coherent Chip-to-Chip Connections • NVLink-C2C with 900 GB/s bandwidth for coherent connection to CPU or GPU • Industry Leading Performance Per Watt • Up to 2X perf / W over today’s leading servers

Slide 7

Slide 7 text

NVLINK-C2C High Speed Chip to Chip Interconnect • Creates Grace Hopper and Grace Superchips • Removes the typical cross-socket bottlenecks • Up to 900GB/s of raw bidirectional BW • Same BW as GPU to GPU NVLINK on Hopper • Low power interface - 1.3 pJ/bit • More than 5x more power efficient than PCIe • Enables coherency for both Grace and Grace Hopper superchips GRACE CPU NVLINK C2C 900 GB/s CPU LPDDR5X CPU LPDDR5X ≤ 512 GB/s

Slide 8

Slide 8 text

NVIDIA Grace for Cloud, AI and HPC Infrastructure Accelerated applications where CPU performance and system memory size and bandwidth are critical; tightly coupled CPU & GPU for flagship AI & HPC. Most versatile compute platform for scale out. GH200 Grace Hopper Superchip Large Scale AI & HPC CPU-based applications where absolute performance, energy efficiency, and data center density matter, such as scientific computing, data analytics, enterprise and hyperscale computing applications Grace CPU Superchip CPU Computing

Slide 9

Slide 9 text

• GPU Classification and Generation • NVIDIA Grace CPU • NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda

Slide 10

Slide 10 text

NVIDIA Grace CPU Superchip 2X Performance at the Same Power for the Modern Data Center • High Performance Power Efficient Cores • 144 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bi-section bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 960GB of data center enhanced LPDDR5X Memory that delivers up to1TB/s of memory bandwidth • Fast and Flexible CPU IO • Up to 8x PCIe Gen5 x16 interface • PCIe Gen 5 up to 128GB/s; 2X more bandwidth compared to PCIe Gen 4 • Full NVIDIA Software Stack • AI, Omniverse

Slide 11

Slide 11 text

NVIDIA Grace is a Compute and Data Movement Architecture NVIDIA Scalable Coherency Fabric and Distributed Cache Design • 72 high performance Arm Neoverse V2 cores with 4x128b SVE2 • 3.2 TB/s bisection bandwidth • 117MB of L3 cache • Local caching of remote die memory • Background data movement via Cache Switch Network

Slide 12

Slide 12 text

Low-Power High-Bandwidth Memory Subsystem LPDDR5X Data Center Enhanced Memory • Optimal balance between bandwidth, energy efficiency and capacity • Up to 1TB/s of raw bidirectional BW • 1/8th power per GB/s vs conventional DDR memory • Similar cost / bit to conventional DDR memory • Data Center class memory with error code correction (ECC) GRACE CPU GRACE CPU NVLINK C2C 900 GB/s CPU LPDDR5X CPU LPDDR5X CPU LPDDR5X CPU LPDDR5X ≤ 512 GB/s ≤ 512 GB/s

Slide 13

Slide 13 text

NVIDIA Grace CPU Delivers 1.9X HPC Data Center Throughput at the Same Power Breakthrough Performance and Efficiency 0.6X 0.6X 0.6X 0.7X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.1X 0.0X 0.5X 1.0X 1.5X 2.0X 2.5X Weather WRF MD CP2K Climate NEMO CFD OpenFOAM Server Performance Intel SPR AMD Genoa NVIDIA Grace 0.6X 0.6X 0.7X 0.7X 1.0X 1.0X 1.0X 1.0X 1.7X 1.8X 1.9X 1.9X 0.0X 0.5X 1.0X 1.5X 2.0X 2.5X Weather WRF MD CP2K Climate NEMO CFD OpenFOAM Data Center Throughput Intel SPR AMD Genoa NVIDIA Grace Data Center level projection of NVIDIA Grace Superchip vs x86 flagship 2-socket data center systems (AMD Epyc 9654 and Intel Xeon 8480+). MD: CP2K RPA 2023.1 Climate: NEMO Gyre_Pisces v4.2.0 Weather: CONUS12, 24 hr simulation 4.4.2 CFD: OpenFOAM Motorbike | Large v2212 NVIDIA Grace Superchip performance based on engineering measurements. Results subject to change.

Slide 14

Slide 14 text

• GPU Classification and Generation • NVIDIA Grace CPU • NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda

Slide 15

Slide 15 text

NVIDIA Grace CPU Building Block of the Superchip • High Performance Power Efficient Cores • 72 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bisection bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 480 GB of data center enhanced LPDDR5X Memory that delivers up to 500 GB/s of memory bandwidth • Coherent Chip-to-Chip Connections • NVLink-C2C with 900 GB/s bandwidth for coherent connection to CPU or GPU • Industry Leading Performance Per Watt • Up to 2X perf / W over today’s leading servers

Slide 16

Slide 16 text

NVIDIA Hopper H100 GPU Breakthrough Performance and Efficiency for the Modern Data Center • Highest AI and HPC Performance • 4PF FP8 (6X) | 2PF FP16 (3X) | 1PF TF32 (3X) | 67TF FP64 (3.4X) • 4TB/s (2X), 96GB HBM3 memory • Transformer Engine • 4th generation Tensor Core optimized for Transformer Models • 6X faster on largest transformer models • Highest Utilization Efficiency and Security • 7 Fully isolated & secured instances, 2nd Gen MIG • Fastest, Scalable Interconnect • 4th Gen NVLink 900 GB/s GPU-to-GPU connectivity up to 256 linked GPUs with NVLink Switch System

Slide 17

Slide 17 text

NVIDIA GH200 Grace Hopper Superchip Processor For The Era of Accelerated Computing And Generative AI 72 Core Grace CPU | 4 PFLOPS Hopper GPU 144 GB HBM3e | 5 TB/s | 900 GB/s NVLink-C2C 144 Core Grace CPU | 8 PFLOPS Hopper GPU 288 GB HBM3e | 10 TB/s | 900 GB/s NVLink-C2C 72 Core Grace CPU | 4 PFLOPS Hopper GPU 96 GB HBM3 | 4 TB/s | 900 GB/s NVLink-C2C • World’s first HBM3e GPU • Combined 624 GB of fast memory • 1.7x capacity and 1.5x bandwidth vs H100 • Full NVIDIA Compute Stack • Simple to deploy MGX-compatible design • Combined 1.2 TB fast memory • 3.5x capacity and 3x bandwidth vs H100 • Full NVIDIA Compute Stack • 7X bandwidth to GPU vs PCIe Gen 5 • Combined 576 GB of fast memory • 1.2x capacity and bandwidth vs H100 • Full NVIDIA Compute Stack GH200 with HBM3 GH200 with HBM3e NVLink Dual-GH200 System

Slide 18

Slide 18 text

Energy Efficient Design More Efficient Computation and Data Movement System Memory (DDR5) 35 pJ/Bit 6.5 pJ/Bit 99 pJ/Flop DP 12 pJ/Flop DP PCIe Gen 5 62 pJ/Flop DP 1.6X less energy System Memory (LPDDR5X) 5 pJ/Bit 7X less energy 12 pJ/Flop DP Equal energy 1.3 pJ/Bit 5X less energy NVLink-C2C

Slide 19

Slide 19 text

Optimizing Performance Through Power Steering Getting the Most Out of provisioned power 0 100 200 300 400 500 600 700 Initial Provisioned Power (650W) x86+H100 GPU Heavy Phase w/out Power Steering GH200 GPU Heavy Phase w/ Power Steering Power (W) Chart Title CPU GPU 200W of CPU power shifted to GPU to maximize App perf. Total Power stays fixed Provisioned Power Limit (650W)

Slide 20

Slide 20 text

GH200 Grace Hopper HPC Platform Unified Memory and Cache Coherence for Next Gen HPC Performance Fast Access Memory 624GB Memory Bandwidth 5TB/s Partially GPU Accelerated Apps No More PCIe Bottleneck CPU & GPU Cache Coherence Big performance gains with no code changes Incremental code changes yield big gains NVLink-C2C is 7X PCIe BW HPC: Preliminary results comparing DGX A100, DGX H100 and GH200 with HBM3 systems. OpenFOAM based on MotorBike, NAMD with Colvars, CP2K with RPA,

Slide 21

Slide 21 text

• GPU Classification and Generation • NVIDIA Grace CPU • NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda

Slide 22

Slide 22 text

AI/ML for Physical Systems What is different? Physics based machine learning Physical system ≈ system governed by first principles or governing partial differential equations

Slide 23

Slide 23 text

Promise of PINNs Data Only vs PINN: Solving The Data Problem • Neural Networks are functions that can be modified to represent almost any other function • Target function: f(x) • NN to approximate it: u(x;W)≅f(x) • Training: find weights W that minimize mismatch at selected data points • Given enough data, Neural Networks can approximate almost any function to any degree of accuracy • But… collecting field data may not always be possible • If we understand the physical laws behind the data, then we can generate enough ! " = "! + 1 &" &"" ' " = 0

Slide 24

Slide 24 text

Open-Source AI Toolkit for Physics-based ML Modulus • Python based APIs for ease of use • Import your PyTorch model* • Reference case studies and recipes as starting points • Facilitates open collaboration within the Physics-ML scientific community • Model architecture Zoo • Well documented features and functionality for ease of use • Source code: https://github.com/NVIDIA/modulus

Slide 25

Slide 25 text

To Learn More about Modulus • はじめてのNVIDIA Modulus –Physics-ML 物理に基づい た機械学習による工学シミュレーション– • 著者: 柴田 良一 (著)、NVIDIA (監修) • 出版社: 工学社 • 発売日: 2023年7月26日 • https://honto.jp/netstore/pd-book_32621454.html • CAE 懇話会 解析塾: Modulus 入門 • 日時: 2023年12月7日 (木) 10:00-16:00 / オンデマンド • 講師: 柴田 良一 教授 (岐阜高専)、丹 愛彦 (NVIDIA) • 会場: 新大阪 • 受講料: 有償 • http://www.cae21.org/kaisekijuku2023/modulus_2023.s html

Slide 26

Slide 26 text

Additional information

Slide 27

Slide 27 text

NVIDIA at SC23 November 12-17, 2023 | Colorado Convention Center, Denver, Colorado • Many NVIDIA sessions: • 7 Tutorials • 5 Workshops • 4 Exhibitor Forums • 4 Papers • 15 Birds of a Feathers (BoF) • 2 Panels • https://www.nvidia.com/en- us/events/supercomputing/

Slide 28

Slide 28 text

NVIDIA GTC March 18-21, 2024 | San Jose Convention Center, San Jose, CA and Virtual Come to GTC—the conference for the era of AI—to connect with a dream team of industry luminaries, developers, researchers, and business experts shaping what’s next in AI and accelerated computing. From the highly anticipated keynote by NVIDIA CEO Jensen Huang to over 600 inspiring sessions, 200+ exhibits, and tons of networking events, GTC delivers something for every technical level and interest area. Be sure to save your spot for this transformative event. You can even take advantage of early-bird pricing when you register by February 7. March 18-21, 2024 | www.nvidia.com/gtc The In-Person GTC Experience Is Back Plask

Slide 29

Slide 29 text

No content