GPU で加速される計算力学シミュレーション / 2022-11-16 CMD2022

GPU で加速される計算⼒学シミュレーション Shinnosuke Furuya, Ph.D., HPC Developer Relations | 2022/11/16

• Data Center GPU Update • GPU Accelerated Applications in
CAE Agenda

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090
Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090
Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 Computing | FP64, FP32 VDI | FP32 Professional Visualization | FP32 Gaming| FP32 Computing | FP32

NVIDIA Hopper The engine for the world’s AI infrastructure 4th
Gen NVLink Transformer Engine 2nd Gen MIG Confidential Computing DPX Instructions World’s Most Advanced Chip H100 PCIE Includes NVIDIA AI Enterprise H100 SXM NVIDIA AI Enterprise is a 5-Year Subscription

NVIDIA H100 Unprecedented Performance, Scalability, and Security for Every Data
Center FP8, FP16, TF32 performance include sparsity. X-factor compared to A100 Highest AI and HPC Performance 4PF FP8 (6X)| 2PF FP16 (3X)| 1PF TF32 (3X)| 60TF FP64 (3.4X) 3.35TB/s (1.5X), 80GB HBM3 memory Transformer Model Optimizations 6X faster on largest transformer models Highest Utilization Efficiency and Security 7 Fully isolated & secured instances, guaranteed QoS 2nd Gen MIG | Confidential Computing Fastest, Scalable Interconnect 900 GB/s GPU-2-GPU connectivity (1.5X) up to 256 GPUs with NVLink Switch | 128GB/s PCI Gen5

NVIDIA HGX H100 The world’s most advanced enterprise AI infrastructure
Tensor Core FLOPs shown with sparsity | Speedups compared to prior generation HIGHEST PERFORMANCE FOR AI AND HPC 4-way / 8-way H100 GPUs with 32 PetaFLOPs FP8 3.6 TFLOPs FP16 in-network SHARP Compute NVIDIA Certified High-Performance Offering from All Makers FASTEST, SCALABLE INTERCONNECT 4th Gen NVLINK with 3X faster All-Reduce communications 3.6 TB/s bisection bandwidth NVLINK Switch System Option Scales Up to 256 GPUs SECURE COMPUTING First HGX System with Confidential Computing

NVIDIA H100 PCIE Unprecedented performance, scalability, and security for mainstream
servers FP8, FP16, TF32 performance include sparsity. X-factor compared to A100 Highest AI and HPC Mainstream Performance 3PF FP8 (5X)| 1.5PF FP16 (2.4X)| 756TF TF32 (2.4X)| 51TF FP64 (2.6X) 6X faster Dynamic Programming with DPX Instructions 2TB/s , 80GB HBM2e memory Highest Compute Energy Efficiency Configurable TDP - 200W to 350W 2 Slot FHFL mainstream form factor Highest Utilization Efficiency and Security 7 Fully isolated & secured instances, guaranteed QoS 2nd Gen MIG | Confidential Computing Highest Performing Server Connectivity 128GB/s PCI Gen5 600 GB/s GPU-2-GPU connectivity (5X PCIe Gen5) up to 2 GPUs with NVLink Bridge

Production AI with NVIDIA H100 and NVIDIA AI Enterprise Develop
and deploy enterprise AI with unmatched performance, security, and scalability 5-YEAR SUBSCRIPTION OF NVIDIA AI ENTERPRISE A cloud native software suite for development and deployment of AI NVIDIA ENTERPRISE SUPPORT Including access to NVIDIA AI experts, priority notifications of the latest security fixes and maintenance releases ENTERPRISE TRAINING SERVICES Developers, data scientists, and IT professionals learn how to get the most out of the NVIDIA AI platform Software activation: www.nvidia.com/activate-h100

NVIDIA Ada Lovelace GPU Architecture NVIDIA RTX 6000 Ada Generation
and NVIDIA L40 Highlights • 3rd Gen RT Cores, 4th-Gen Tensor Cores • Up to 2x faster Graphics and AI training performance than Ampere • Up to 2x the single-precision floating-point throughput over the Ampere, support for FP8 format • 3x Encode & 3x Decode - support for AV1 Enc/Dec • 48GB of GDDR6 ECC memory for working with the largest 3D models, renderings, simulation, and AI datasets

NVIDIA L40 • Next-Generation Graphics Performance • 2x ray tracing
performance vs. Ampere Generation • Powerful Compute and AI • Accelerated Rendering, Training, and Inference • Data Center Ready • Secure and Measure Boot w. RoT Available starting in December 2022 Unprecedented visual computing performance for the data center *Preliminary specifications, subject to change

NVIDIA RTX 6000 Ada Generation • Built for the largest,
most demanding industries & workloads • Most powerful professional GPU ever • 2X the performance of the A6000 • RTX A6000 will still be available • For customers needing NVLink/largest possible GPU memory • Available from channel partners starting in December • Anticipate OEM availability early in 2023 Ada for the enterprise *Preliminary specifications, subject to change

Hopper GPU Architecture

Hopper GPU Architecture Streaming Multiprocessor (SM) GH100 (Hopper) GA100 (Ampere)
64x FP64 128x FP32 64x INT32 4x 4th Gen TC 256 KB L1 32x FP64 64x FP32 64x INT32 4x 3rd Gen TC 192 KB L1

Ada GPU Architecture Streaming Multiprocessor (SM) AD102 (Ada Lovelace) GA102
(Ampere) 64x FP32 64x FP32/INT32 4x 4th Gen TC 128 KB L1 1x 3rd Gen RT Core 64x FP32 64x FP32/INT32 4x 3rd Gen TC 128 KB L1 1x 2nd Gen RT Core

H100 A100 A30 A2 L40 A40 A10 A16 Design Highest
Perf AI, Big NLP, HPC, DA High Perf Compute Mainstream Compute Entry-Level Small Footprint Powerful Universal Graphics + AI High Perf Graphics Mainstream Graphics & Video with AI High Density Virtual Desktop Form Factor SXM5 x16 PCIe Gen5 2 Slot FHFL 3 NVLINK Bridge SXM4 x16 PCIe Gen4 2 Slot FHFL 3 NVLink Bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLink Bridge x8 PCIe Gen4 1 Slot LP x16 PCIe Gen4 2 Slot FHFL x16 PCIe Gen4 2 Slot FHFL 1 NVLink Bridge x16 PCIe Gen4 1 slot LP x16 PCIe Gen4 2 Slot FHFL Max Power 700W 350W 500W 300W 165W 40-60W 300W 300W 150W 250W FP64 TC | FP32 TFLOPS2 67 | 67 51 | 51 19.5 | 19.5 10 |10 NA | 4.5 NA | TBD3 NA | 37 NA | 31 NA | 4x4.5 TF32 TC | FP16 TC TFLOPS2 989 | 1979 756 | 1513 312 | 624 165 | 330 18 | 36 TBD3 | TBD3 150 | 300 125 | 250 4x18 | 4x36 FP8 TC | INT8 TC TFLOPS/TOPS2 3958 | 3958 3026 | 3026 NA | 1248 NA | 661 NA | 72 TBD3 | TBD3 NA | 600 NA | 500 NA | 4x72 GPU Memory / Speed 80GB HBM3 80GB HBM2e 80GB HBM2e 24GB HBM2 16GB GDDR6 48GB GDDR6 48GB GDDR6 24GB GDDR6 4x 16GB GDDR6 Multi-Instance GPU (MIG) Up to 7 Up to 7 Up to 4 - - - - - NVLink Connectivity Up to 256 2 Up to 8 2 2 - - 2 - - Media Acceleration 7 JPEG Decoder 7 Video Decoder 1 JPEG Decoder 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 1 Video Encoder 2 Video Decoder (+AV1 decode) 3 Video Encoder 3 Video Decoder 4 JPEG Decoder 1 Video Encoder 2 Video Decoder (+AV1 decode) 4 Video Encoder 8 Video Decoder (+AV1 decode) Ray Tracing - - Yes Yes Transformer Engine Yes - - - - - DPX Instructions Yes - - - - - Graphics For in-situ visualization (no NVIDIA vPC or RTX vWS) For in-situ visualization (no NVIDIA vPC or RTX vWS) Good Top-of-Line Best Better Good vGPU Yes Yes* Yes Hardware Root of Trust Internal and External Internal with Option for External Internal Internal with Option for External Confidential Computing Yes (1) - - - - - - NVIDIA AI Enterprise Add-on Included Add-on Add-on 1. Supported on Azure NVIDIA A100 with reduced performance compared to A100 without Confidential Computing or H100 with Confidential Computing. 2. All Tensor Core numbers with sparsity. Without sparsity is ½ the value. 3. Precision TFLOP performance will be added in future update DATA CENTER GPU COMPARISON - SEPT ‘22

• Data Center GPU Update • GPU Accelerated Applications in
CAE Agenda

Accelerated Apps Catalog https://www.nvidia.com/en-us/gpu-accelerated-applications/

Accelerated Apps Catalog Computational Electromagnetics & Computational Fluid Dynamics &
Computational Structural Mechanics • Actran • Discontinuous Galaerkin Method solver • Altair AcuSolve • Linear solvers for flow, temperature, turbulence model, and mesh movement equations • Altair EDEM • EDEM Simulator, a DEM solver | Integration with Ansys and Abaqus for FEA for bulk material simulation | Integration with Adams, Siemens and RecurDyn for Multi-body Dynamics | Integration with Ansys Fluent for Particle-Fluid Systems • Altair nanoFluidX • Extremely fast | Single and Multiphase Flows | Arbitrary motion definition | Time-dependent acceleration | Inlets/outlets | Surface tension and adhension | Steady-state thermal solutions through coupling • Altair OptiStruct • Direct Solver | Eigenvalue solvers | Iterative solver • Altair ultraFluidX • CUDA-accelerated high-fidelity flow field computations based on the Lattice Boltzmann method | CUDA-aware MPI support for multi-GPU and multi-node usage | Efficient implementation of tailor-made automotive features, including rotating wheels, belt systems, boundary layer suction and porous media support • Ansys Fluent • Linear equation solver | Radiation heat transfer model | Discrete Ordinate Radiation model • Ansys Icepak • Linear Equation Solver • Ansys Mechanical • Direct and iterative solvers • Ansys Polyflow • Direct Solvers • MSC Nastran • Direct sparse solver • Particleworks • Explicit and Implicit methods • Rocky DEM • Explicit DEM solver | 1-way & 2-way coupling with ANSYS Fluent and ANSYS Mechanical • Simcenter STAR-CCM+ • Steady and unsteady, constant density flows using the segregatd flow solver on Linux only | Compatible with most turbulence models, RANS, DDES and Reynolds Stress Models | OpenGL based Rendering • Simulia Abaqus/Standard • Direct sparse solver | AMS Solver | Steady State Dynamics • XFlow • Single & Multiphase flow | Enforced motion | Adaptive refinement https://www.nvidia.com/en-us/gpu-accelerated-applications/

Siemens Simcenter STAR-CCM+ NVIDIA 8x A100 is ~20X faster over
CPU Server Higher is Better 0.67 0.87 1.00 16.54 19.81 0.0x 5.0x 10.0x 15.0x 20.0x 25.0x 32 CPU Cores AMD Milan 64 CPU Cores AMD Milan 128 CPU Cores AMD Milan 128 CPU Cores 6xNVIDIA A100 PCIe 128 CPU Cores 8xNVIDIA A100 PCIe Speedup relative to AMD Milan (128 CPU Cores) CPU/GPU Servers STAR-CCM+ 2210 - CPU & GPU Performance Model: DrivAer LES 128M Tests run on NVIDIA servers - AMD Milan:- 2x AMD EPYC 7763 (64-Core Processor), Samsung SSD 860 EVO 2TB, Rocky 8.6, 512 GB RAM 8x NVIDIA A100 PCIe:- 2x AMD EPYC 7742 (64-Core Processor), Samsung SSD 860 EVO 2TB, NVIDIA A100-PCIE-80GB, Driver - 520.50, Rocky 8.6, 512 GB RAM Benchmark Model:- DrivAer LES 128M. 20 Iterations. Source: https://blogs.sw.siemens.com/simcenter/les-on-gpus /

Ansys Fluent Ansys Blog • Unleashing the Full Power of
GPUs for Ansys Fluent, Part 1 • https://www.ansys.com/blog/unleashing-the-full-power-of-gpus-for-ansys-fluent • 32X Speed Up for Automotive External Aerodynamics • Laminar Flow Over a Sphere • Backward Facing Step • Unleashing the Full Power of GPUs for Ansys Fluent, Part 2 • https://www.ansys.com/blog/unleashing-the-full-power-of-gpus-for-ansys-fluent-part-2 • Speeding Up CFD Simulations of All Sizes • Air Flow Through a Porous Filter • Thermal Management Using Conjugate Heat Transfer Modeling (CHT) • Water-Cooled Traction Inverter • Louvered-fin Heat Exchanger • Vertical Mounted Heat Sink • Revolutionizing CFD Simulations Through GPUs

NVIDIA 冬の HPC Weeks • 形式 : オンライン • 参加費
: 無料 • 主催 : エヌビディア合同会社 • 協⼒ : 東京⼯業⼤学学術国際情報センター、名古屋⼤学情報基盤センター、GPU コンピューティング研究会 • Week 1 : 2022 年 11 ⽉ 29 ⽇ (⽕) 15:30 – 19:00 • Week 2 : 2022 年 12 ⽉ 8 ⽇ (⽊) 15:30 – 19:00 https://events.nvidia.com/nvidiahpcweeks Hatem Ltaief KAUST ⻘⽊尊之東京⼯業⼤学森健策名古屋⼤学 | 国⽴情報学研究所

Resources • NVIDIA Hopper Architecture • https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/ • NVIDIA H100
Tensor Core GPU • https://www.nvidia.com/en-us/data-center/h100/ • NVIDIA HGX AI Supercomputer • https://www.nvidia.com/en-us/data-center/hgx/ • NVIDIA AI Enterprise • https://www.nvidia.com/en-us/data-center/products/ai-enterprise/ • NVIDIA Ada Lovelace Architecture for Professional Visualization • https://www.nvidia.com/en-us/design-visualization/ada-lovelace-architecture/ • NVIDIA L40 GPU • https://www.nvidia.com/en-us/data-center/l40/ • NVIDIA RTX 6000 Ada Generation Graphics Card • https://www.nvidia.com/en-us/design-visualization/rtx-6000/ • Accelerated Apps Catalog • https://www.nvidia.com/en-us/gpu-accelerated-applications/ • NVIDIA 冬の HPC Weeks • https://events.nvidia.com/nvidiahpcweeks

GPU で加速される計算力学シミュレーション / 2022-11-16 CMD2022

GPU で加速される計算力学シミュレーション / 2022-11-16 CMD2022

Shinnosuke Furuya

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript

GPU で加速される計算⼒学シミュレーション Shinnosuke Furuya, Ph.D., HPC Developer Relations | 2022/11/16

• Data Center GPU Update • GPU Accelerated Applications in

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090

NVIDIA Hopper The engine for the world’s AI infrastructure 4th

NVIDIA H100 Unprecedented Performance, Scalability, and Security for Every Data

NVIDIA HGX H100 The world’s most advanced enterprise AI infrastructure

NVIDIA H100 PCIE Unprecedented performance, scalability, and security for mainstream

Production AI with NVIDIA H100 and NVIDIA AI Enterprise Develop

NVIDIA Ada Lovelace GPU Architecture NVIDIA RTX 6000 Ada Generation

NVIDIA L40 • Next-Generation Graphics Performance • 2x ray tracing

NVIDIA RTX 6000 Ada Generation • Built for the largest,

Hopper GPU Architecture

Hopper GPU Architecture Streaming Multiprocessor (SM) GH100 (Hopper) GA100 (Ampere)

Ada GPU Architecture Streaming Multiprocessor (SM) AD102 (Ada Lovelace) GA102

H100 A100 A30 A2 L40 A40 A10 A16 Design Highest

• Data Center GPU Update • GPU Accelerated Applications in

Accelerated Apps Catalog https://www.nvidia.com/en-us/gpu-accelerated-applications/

Accelerated Apps Catalog Computational Electromagnetics & Computational Fluid Dynamics &

Siemens Simcenter STAR-CCM+ NVIDIA 8x A100 is ~20X faster over

Ansys Fluent Ansys Blog • Unleashing the Full Power of

NVIDIA 冬の HPC Weeks • 形式 : オンライン • 参加費

Resources • NVIDIA Hopper Architecture • https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/ • NVIDIA H100