材料シミュレーションが加速する！ GPU コンピューティングの最新情報 / 2022-10-27 SCSK

材料シミュレーションが加速する！ GPU コンピューティングの最新情報 Shinnosuke Furuya, Ph.D., HPC Developer Relations |
2022/10/27

• Data Center GPU Update • GPU Accelerated Applications in
Materials Science Agenda

NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090
Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40

NVIDIA Hopper The engine for the world’s AI infrastructure 4th
Gen NVLink Transformer Engine 2nd Gen MIG Confidential Computing DPX Instructions World’s Most Advanced Chip H100 PCIE Includes NVIDIA AI Enterprise H100 SXM NVIDIA AI Enterprise is a 5-Year Subscription

NVIDIA H100 Unprecedented Performance, Scalability, and Security for Every Data
Center FP8, FP16, TF32 performance include sparsity. X-factor compared to A100 Highest AI and HPC Performance 4PF FP8 (6X)| 2PF FP16 (3X)| 1PF TF32 (3X)| 60TF FP64 (3.4X) 3.35TB/s (1.5X), 80GB HBM3 memory Transformer Model Optimizations 6X faster on largest transformer models Highest Utilization Efficiency and Security 7 Fully isolated & secured instances, guaranteed QoS 2nd Gen MIG | Confidential Computing Fastest, Scalable Interconnect 900 GB/s GPU-2-GPU connectivity (1.5X) up to 256 GPUs with NVLink Switch | 128GB/s PCI Gen5

NVIDIA HGX H100 The world’s most advanced enterprise AI infrastructure
Tensor Core FLOPs shown with sparsity | Speedups compared to prior generation HIGHEST PERFORMANCE FOR AI AND HPC 4-way / 8-way H100 GPUs with 32 PetaFLOPs FP8 3.6 TFLOPs FP16 in-network SHARP Compute NVIDIA Certified High-Performance Offering from All Makers FASTEST, SCALABLE INTERCONNECT 4th Gen NVLINK with 3X faster All-Reduce communications 3.6 TB/s bisection bandwidth NVLINK Switch System Option Scales Up to 256 GPUs SECURE COMPUTING First HGX System with Confidential Computing

NVIDIA H100 PCIE Unprecedented performance, scalability, and security for mainstream
servers FP8, FP16, TF32 performance include sparsity. X-factor compared to A100 Highest AI and HPC Mainstream Performance 3PF FP8 (5X)| 1.5PF FP16 (2.4X)| 756TF TF32 (2.4X)| 51TF FP64 (2.6X) 6X faster Dynamic Programming with DPX Instructions 2TB/s , 80GB HBM2e memory Highest Compute Energy Efficiency Configurable TDP - 200W to 350W 2 Slot FHFL mainstream form factor Highest Utilization Efficiency and Security 7 Fully isolated & secured instances, guaranteed QoS 2nd Gen MIG | Confidential Computing Highest Performing Server Connectivity 128GB/s PCI Gen5 600 GB/s GPU-2-GPU connectivity (5X PCIe Gen5) up to 2 GPUs with NVLink Bridge

Production AI with NVIDIA H100 and NVIDIA AI Enterprise Develop
and deploy enterprise AI with unmatched performance, security, and scalability 5-YEAR SUBSCRIPTION OF NVIDIA AI ENTERPRISE A cloud native software suite for development and deployment of AI NVIDIA ENTERPRISE SUPPORT Including access to NVIDIA AI experts, priority notifications of the latest security fixes and maintenance releases ENTERPRISE TRAINING SERVICES Developers, data scientists, and IT professionals learn how to get the most out of the NVIDIA AI platform Software activation: www.nvidia.com/activate-h100

NVIDIA Ada Lovelace GPU Architecture NVIDIA RTX 6000 Ada Generation
and NVIDIA L40 Highlights • 3rd Gen RT Cores, 4th-Gen Tensor Cores • Up to 2x faster Graphics and AI training performance than Ampere • Up to 2x the single-precision floating-point throughput over the Ampere, support for FP8 format • 3x Encode & 3x Decode - support for AV1 Enc/Dec • 48GB of GDDR6 ECC memory for working with the largest 3D models, renderings, simulation, and AI datasets

NVIDIA RTX 6000 Ada Generation • Built for the largest,
most demanding industries & workloads • Most powerful professional GPU ever • 2X the performance of the A6000 • RTX A6000 will still be available • For customers needing NVLink/largest possible GPU memory • Available from channel partners starting in December • Anticipate OEM availability early in 2023 Ada for the enterprise *Preliminary specifications, subject to change

NVIDIA L40 • Next-Generation Graphics Performance • 2x ray tracing
performance vs. Ampere Generation • Powerful Compute and AI • Accelerated Rendering, Training, and Inference • Data Center Ready • Secure and Measure Boot w. RoT Available starting in December 2022 Unprecedented visual computing performance for the data center *Preliminary specifications, subject to change

H100 A100 A30 A2 L40 A40 A10 A16 Design Highest
Perf AI, Big NLP, HPC, DA High Perf Compute Mainstream Compute Entry-Level Small Footprint Powerful Universal Graphics + AI High Perf Graphics Mainstream Graphics & Video with AI High Density Virtual Desktop Form Factor SXM5 x16 PCIe Gen5 2 Slot FHFL 3 NVLINK Bridge SXM4 x16 PCIe Gen4 2 Slot FHFL 3 NVLink Bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLink Bridge x8 PCIe Gen4 1 Slot LP x16 PCIe Gen4 2 Slot FHFL x16 PCIe Gen4 2 Slot FHFL 1 NVLink Bridge x16 PCIe Gen4 1 slot LP x16 PCIe Gen4 2 Slot FHFL Max Power 700W 350W 500W 300W 165W 40-60W 300W 300W 150W 250W FP64 TC | FP32 TFLOPS2 67 | 67 51 | 51 19.5 | 19.5 10 |10 NA | 4.5 NA | TBD3 NA | 37 NA | 31 NA | 4x4.5 TF32 TC | FP16 TC TFLOPS2 989 | 1979 756 | 1513 312 | 624 165 | 330 18 | 36 TBD3 | TBD3 150 | 300 125 | 250 4x18 | 4x36 FP8 TC | INT8 TC TFLOPS/TOPS2 3958 | 3958 3026 | 3026 NA | 1248 NA | 661 NA | 72 TBD3 | TBD3 NA | 600 NA | 500 NA | 4x72 GPU Memory / Speed 80GB HBM3 80GB HBM2e 80GB HBM2e 24GB HBM2 16GB GDDR6 48GB GDDR6 48GB GDDR6 24GB GDDR6 4x 16GB GDDR6 Multi-Instance GPU (MIG) Up to 7 Up to 7 Up to 4 - - - - - NVLink Connectivity Up to 256 2 Up to 8 2 2 - - 2 - - Media Acceleration 7 JPEG Decoder 7 Video Decoder 1 JPEG Decoder 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 1 Video Encoder 2 Video Decoder (+AV1 decode) 3 Video Encoder 3 Video Decoder 4 JPEG Decoder 1 Video Encoder 2 Video Decoder (+AV1 decode) 4 Video Encoder 8 Video Decoder (+AV1 decode) Ray Tracing - - Yes Yes Transformer Engine Yes - - - - - DPX Instructions Yes - - - - - Graphics For in-situ visualization (no NVIDIA vPC or RTX vWS) For in-situ visualization (no NVIDIA vPC or RTX vWS) Good Top-of-Line Best Better Good vGPU Yes Yes* Yes Hardware Root of Trust Internal and External Internal with Option for External Internal Internal with Option for External Confidential Computing Yes (1) - - - - - - NVIDIA AI Enterprise Add-on Included Add-on Add-on 1. Supported on Azure NVIDIA A100 with reduced performance compared to A100 without Confidential Computing or H100 with Confidential Computing. 2. All Tensor Core numbers with sparsity. Without sparsity is ½ the value. 3. Precision TFLOP performance will be added in future update DATA CENTER GPU COMPARISON - SEPT ‘22

• Data Center GPU Update • GPU Accelerated Applications for
Materials Science Agenda

Accelerated Apps Catalog https://www.nvidia.com/en-us/gpu-accelerated-applications/

Accelerated Apps Catalog Material Science & Molecular Dynamics & Quantum
Chemistry • Amber • PMEMD Explicit Solvent and GB Implicit Solvent • DeePMD-kit • TensorFlow • High-performance classical MD and quantum (path-integral) MD package • Deep Potential series models • MPI and GPU support • GROMACS • Implicit (5x) • Explicit (2x) Solvent • LAMMPS • Lennard-Jones • Gay-Berne • Tersoff • NAMD • Full electrostatic with PME and most simulation features • 100M atom capable • CP2K • DBCSR (space matrix multiply library) • GAMESS-US • Libqc with Rys Quadrature Algorithm • Hartree-Fock • MP2 and CCSD • NWChem • Triples part of Reg-CCSD(T) • CCSD and EOMCCSD task schedulers • Quantum Espresso • PWscf package: linear algebra (martix multiply) • Explicit computational kernels • 3D FFTs • VASP • Standard & Hybrid DFT, including meta-GGA and vdW-DFT • All solvers (Davidson, RMM-DIIS, Direct optimizers, linear response) • Real and reciprocal projection scheme • All executable flavors (vasp_std, vasp_gam and vasp_ncl) https://www.nvidia.com/en-us/gpu-accelerated-applications/

NGC – Portal To Enterprise Services, Software, Support Artificial Intelligence
| Metaverse | HPC NGC ngc.nvidia.com Software ON-PREM CLOUD EDGE Services NeMo LLM service LLM Customization | Managed API | Playground Riva Studio Custom voice | No-code GUI BioNeMo service Predict 3D protein structures, properties Base Command Training tools | Workflow mgmt | Monitoring, reporting Fleet Command Managed Edge AI | Layered Security | Monitoring, reporting Enterprise Catalog NVIDIA AI SW | Support | Proprietary SW Containers Performance optimized AI Models Speech, CV, data science, more Jupyter Notebooks End-to-end example workflows Helm Charts Automate deployments on K8s Support NVIDIA AI Enterprise Support Business Standard | Business critical |Dedicated TAM Containers!!

NGC Catalog

Benchmarks

NVIDIA HPC Application Performance https://developer.nvidia.com/hpc-application-performance

GPU Spec FP64 FP32 Memory Memory BW H100 SXM 34
TFLOPS 67 TFLOPS 80GB HBM3 3.35 TB/s H100 PCIe 26 TFLOPS 51 TFLOPS 80GB HBM2e 2 TB/s A100 SXM 9.7 TFLOPS 19.5 TFLOPS 80GB HBM2e 2039 GB/s A100 PCIe 9.7 TFLOPS 19.5 TFLOPS 80GB HBM2e 1935 GB/s A30 5.2 TFLOPS 10.3 TFLOPS 24GB HBM2 933 GB/s A40 37.4 TFLOPS 48GB GDDR6 696 GB/s V100 SXM 7.8 TFLOPS 15.7 TFLOPS 32GB HBM2 900 GB/s V100S PCIe 8.2 TFLOPS 16.4 TFLOPS 32GB HBM2 1134 GB/s T4 8.1 TFLOPS 16GB GDDR6 300 GB/s

Benchmark Dataset Molecular Dynamics • “JAC” (DHFR; dihydrofolate reductase) •
23,588 atoms • FactorIX • 90,906 atoms • Cellulose • 408,609 atoms • STMV (satellite tobacco mosaic virus) • 1,067,095 atoms • ADH Dodec • 95,561 atoms • ApoA1 (apolipoprotein A1) • 92,224 atoms

Amber Version: 22.0-AT_22.3 (Intel CPU 20.12-AT_21.12) 0 200 400 600
800 1000 1200 1400 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ns/day DC-Cellulose_NVE 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU https://developer.nvidia.com/hpc-application-performance 0 200 400 600 800 1000 1200 1400 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ns/day DC-Cellulose_NPT 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU

Gromacs Version: 2022.2 0 50 100 150 200 250 300
350 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ns/day Cellulose 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU https://developer.nvidia.com/hpc-application-performance 0 100 200 300 400 500 600 700 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ns/day ADH Dodec 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU

LAMMPS Version: stable_23Jun2022_update1 0.00E+00 5.00E+06 1.00E+07 1.50E+07 2.00E+07 2.50E+07 3.00E+07
A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ATOM-Time Steps/s ReaxFF/C 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 0.00E+00 2.00E+06 4.00E+06 6.00E+06 8.00E+06 1.00E+07 1.20E+07 1.40E+07 1.60E+07 1.80E+07 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 ATOM-Time Steps/s SNAP 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU https://developer.nvidia.com/hpc-application-performance

NAMD Version: GPU, AMD CPU V 3.0a13 ; Intel CPU
V 2.15a AVX512 0 200 400 600 800 1000 1200 1400 1600 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 Ave ns/day apoa1_npt_cuda 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU https://developer.nvidia.com/hpc-application-performance 0 20 40 60 80 100 120 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 Ave ns/day stmv_npt_cuda 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU

Quantum Espresso Version: V7.0 CPU; V7.1 GPU 0 100 200
300 400 500 600 700 800 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 Total CPU Time (Sec) AUSURF112-jR 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 0 5 10 15 20 25 A100 SXM A100 PCIe A30 A40 V100 SXM V100S PCIe T4 Speedup AUSURF112-jR 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU https://developer.nvidia.com/hpc-application-performance

Features Available and Accelerated in VASP 6.3 • Existing acceleration
• New acceleration • Acceleration work in progress • On acceleration roadmap LEVELS OF THEORY Standard DFT (incl. meta-GGA, vdW-DFT) Hybrid DFT (double buffered) Cubic-scaling RPA (ACFDT, GW) Bethe-Salpeter Equations (BSE) PROJECTION SCHEME Real space Reciprocal space SOLVERS / MAIN ALGORITHM Davidson (+Adaptively Compressed Exch.) RMM-DIIS Davidson+RMM-DIIS Direct optimizers (Damped, All) Linear response EXECUTABLE FLAVORS Standard variant Gamma-point simplification variant Non-collinear spin variant

Details on Dataset Si-Huge • Cell size: 15.4x30.7x30.7 Å³ •
Atoms: 512 Si • 14 k-points, 1281 bands, 245.4 eV cutoff Energy, 89 614 PWs • Standard DFT (GGA: PW91) • Algo=Normal (Davidson) • Real-space projection scheme

Details on Dataset CuC_vdW • Cell size: 10.3x10.3x31.5 Å³ •
Atoms: 96 Cu, 2 C (98 total) • 5 k-points, 638 bands, 400 eV cutoff Energy, 52 405 PWs • Standard DFT (GGA: PBE) • Algo=VeryFast (RMM-DIIS) • Real-space projection scheme

Details on Dataset GaAsBi_512 • Cell size: 22.6x22.6x22.6 Å³ •
Atoms: 256 Ga, 255 As, 1 Bi (512 total) • 4 k-points, 1536 bands, 313 eV cutoff Energy, 145 484 PWs • Standard DFT (GGA: PBE) • Algo=Fast (Davidson + RMM-DIIS) • Real-space projection scheme

Details on Dataset Si256_VJT_HSE06 • Cell size: 18.9x18.9x18.9 Å³ •
Atoms: 256 Si • 1 k-point (Γ), 640 bands, 250 eV cutoff Energy, 23 589 PWs • Hybrid DFT (PBE0) • Algo=Damped (direct minimizer) • Real-space projection scheme

VASP 6.3.0

VASP 6.2.0 vs 6.1.2 0 2 4 6 8 10
12 14 16 2x EPYC 1x A100 2x A100 4x A100 8x A100 6.1.2 6.2.0 Better than 22% improvement VASP versions Speedup - relative to 6.1.2 on Epyc 6.1.2 6.1.2 6.1.2 6.2.0 6.2.0 6.2.0 Dataset: Si256_VJT_HSE06 CPU only Rome 7742 6.2.0 6.1.2 # of GPUs (A100 SXM4 80 GB)

材料シミュレーションが加速する！ GPU コンピューティングの最新情報 / 2022-10-2...

材料シミュレーションが加速する！ GPU コンピューティングの最新情報 / 2022-10-27 SCSK

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript