Upgrade to Pro — share decks privately, control downloads, hide ads and more …

計算力学シミュレーションに GPU は役立つのか? / 2021-09-21 CMD2021

計算力学シミュレーションに GPU は役立つのか? / 2021-09-21 CMD2021

2021/09/21 - 日本機械学会 第 34 回計算力学講演会 (CMD2021)

D047055d6a09404a7062b41361470731?s=128

Shinnosuke Furuya

September 21, 2021
Tweet

More Decks by Shinnosuke Furuya

Other Decks in Technology

Transcript

  1. Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA 2021/09/21 計算⼒学シミュレーションに GPU

    は役⽴つのか?
  2. 2 Founded in 1993 Jensen Huang, Founder & CEO 19,000

    Employees $16.7B in FY21 Santa Clara Tokyo 50+ Offices
  3. 3 NVIDIA GPUS

  4. 4 NVIDIA GPUS AT A GLANCE Fermi (2010) Kepler (2012)

    M2090 Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 P4 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 A40 A10 A16
  5. 5 AMPERE GPU ARCHITECTURE A100 Tensor Core GPU 7 GPCs

    7 or 8 TPCs/GPC 2 SMs/TPC (108 SMs/GPU) 5 HBM2 stacks GPC: GPU Processing Cluster, TPC: Texture Processing Cluster, SM: Streaming Multiprocessor 12 NVLink links
  6. 6 AMPERE GPU ARCHITECTURE Streaming Multiprocessor (SM) GA100 (A100, A30)

    GA102 (A40, A10) 32 FP64 CUDA Cores 64 FP32 CUDA Cores 4 Tensor Cores Up to 128 FP32 CUDA Cores 1 RT Core 4 Tensor Cores
  7. 7 DATA CENTER PRODUCT COMPARISON (SEPT 2021) * Performance with

    structured sparse matrix A100* A30* A40 A10 T4 Performance FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - - FP64 Tensor Core 19.5 TFlops 10.3 TFlops N/A N/A N/A FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 Tflops 31.2 TFlops 8.1 TFlops TF32 Tensor Core 156 | 312 TFlops* 82 | 165 Tflops* 74.8 | 149.6 TFlops* 62.5 | 125 TFlops* N/A FP16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* 65 TFlops BFloat16 Tensor Core 312 | 624 Tflops* 165 | 330 TFlops* 149.7 | 299.4 TFlops* 125 | 250 TFlops* N/A Int8 Tensor Core 624 | 1248 TOPS* 330 | 661 TOPS* 299.3 | 598.6 TOPS* 250 | 500 TOPS* 130 TOPS Int4 Tensor Core 1248 | 2496 TOPS* 661 | 1321 TOPS* 598.7 | 1197.4 TOPS* 500 | 1000 TOPS* 260 TOPS Form Factor SXM4 module on baseboard x16 PCIe Gen4 2 Slot FHFL 3 NVLINK bridges x16 PCIe Gen 4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLINK bridge x16 PCIe Gen 4 1 Slot FHFL x16 PCIe Gen 3 1 Slot LP GPU Memory 40 | 80 GB HBM2e 40 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6 GPU Memory Bandwidth 1555 | 1935 GB/s 1555 | 2039 GB/s 933 GB/s 696 GB/s 600 GB/s 300 GB/s Multi-Instance GPU Up to 7 Up to 7 Up to 4 N/A N/A N/A Media Acceleration 1 JPEG Decoder, 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 1 Video Encoder, 2 Video Decoder (+AV1 decode) 1 Video Encoder, 2 Video Decoder Ray Tracing No No No Yes Yes Yes Graphics For in-situ visualization (no vPC/vQuadro) Best Better Good Max Power 400 W 250 | 300 W 165 W 300 W 150 W 70 W
  8. 8 TF32 TENSOR CORE TO SPEEDUP FP32 Range of FP32

    and Precision of FP16 Input in FP32 and Accumulation in FP32 FP32 Matrix FP32 Matrix FP32 Matrix Format to TF32 and Multiply FP32 Accumulate 23 bits 8 bits 10 bits 10 bits 7 bits 8 bits 5 bits 8 bits FP32 TF32 FP16 BFloat16 Sign Range Precision TF32 Range TF32 Precision
  9. 9 GPU ACCELERATED APPLICATION PERFORMANCE

  10. 10 GPU ACCELERATED APPS GPU Applications https://www.nvidia.com/en-us/gpu-accelerated-applications/ Search from here

    Supported features
  11. 11 GPU ACCELERATED APPS GPU Applications GPU scaling Supported features

    LS-DYNA - - ABAQUS/STANDARD Multi-GPU / Multi-Node * Direct sparse solver * AMS Solver * Steady State Dynamics STAR-CCM+ Single GPU / Single Node * Rendering Fluent Multi-GPU / Multi-Node * Linear equation solver * Radiation heat transfer model * Discrete Ordinate Radiation model Nastran - - Particleworks Multi-GPU / Multi-Node * Explicit and Implicit methods
  12. 12 GPU ACCELERATED APPS GPU Applications Catalog https://images.nvidia.com/aem-dam/Solutions/Data-Center/tesla-product-literature/gpu-applications-catalog.pdf

  13. 13 GPU ACCELERATED APPS HPC Application Performance https://developer.nvidia.com/hpc-application-performance/

  14. 14 DS SIMULIA CST STUDIO Time Domain Solver 0 5000

    10000 15000 20000 25000 100^3 150^3 200^3 300^3 Mcells/sec Throughput Simulation Size FIT Performance A6000_1GPU A6000_2GPU A100_1GPU A100_2GPU 3.5x 3.8x 2.5x 3.2x Higher is Better A100 System :- Dual Xeon Gold 6148 2x20 cores, 384GB RAM, DDR4-2666M, 2 A100-PCIE-40GB GPUs, RHEL7.7 A6000 System:- Dual AMD EPYC 7F72 24 cores, 1TB RAM, A6000 GPUs, NV Driver 455.45.01, Win 10
  15. 15 ALTAIR CFD (SPH SOLVER) Ampere PCIe scaling performance Aerospace

    Gearbox Size: ~21M Fluid particles (~26.7M total) 1000 timesteps Higher is Better 1.0 1.0 0.5 1.0 1.8 1.8 1.0 1.7 3.5 3.5 1.9 3.4 6.2 6.3 3.6 6.1 0X 1X 2X 3X 4X 5X 6X 7X A100 80GB PCIe A100 40GB PCIe A30 PCIe A40 PCIe Relative Performance Aerospace Gearbox 26M Altair CFD SPH Solver (Altair® nanoFluidX®) on NVIDIA EGX Server 1 GPU 2 GPUs 4 GPUs 8 GPUs Tests run on a server with 2x AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome), 64-core CPU, Driver 465.19.01, 512GB RAM, 8x NVIDIA GPUs, Ubuntu 20.04, ECC off, HT Off Relative performance calculated based on the average model performance (nanoseconds/particles/timesteps) on Altair nanoFluidX 2021.0
  16. 16 ROCKY DEM 4.4 ROTATING DRUM Benchmark with polyhedron and

    spherical shaped particles 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on V100 1 x V100 2 x V100 4 x V100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) 0 10 20 30 40 50 60 70 80 90 100 Polyhedron Cells Performance on A100 1 x A100 2 x A100 4 x A100 Speed-up (relative to 8xCPU cores Intel Xeon Gold 6230 CPU at 2.10 GHz 31.25 62.5 125 250 500 1000 2000 Number of particles per GPU (x1000) Higher is Better Higher is Better 38x speedup for 1xV100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz 47x speedup for 1xA100 when compared with an 8-core CPU Intel Xeon Gold 6230 @ 2.10GHz Case Description: Particles in a drum rotating at 1 rev/sec, simulated for 20,000 solver iterations Hardware on Oracle cloud: CPU: Intel Xeon Gold 6230 @2.1 GHz (8 cores)GPU: NVIDIA A100 and V100
  17. 17 PARAVIEW WITH GPU ACCELERATION

  18. 18 PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION NVIDIA RTX

    テクノロジの⼀つがレイ トレーシング、NVIDIA OptiX など最適化されたレイ トレーシング API がある NVIDIA OptiX は Paraviwe にインテグレートされている RT コアが搭載された GPU では、さらに加速される レンダリングには時間がかかる 画像の品質にこだわりはじめると、光源やテクスチャなど様々な設定を変えてレンダリングを繰り返す 科学技術計算で得られるような⼤規模データを使ったレイ トレーシングによる可視化はストレスフル ParaView 経由で NVIDIA OptiX を使い、さらに RT コアでレイ トレーシングが加速されると嬉しいのでは #NVIDIA技術ブログ の記事から https://medium.com/nvidiajapan/62b7c70e732a
  19. 19 PARAVIEW と NVIDIA OPTIX による SCIENTIFIC VISUALIZATION #NVIDIA技術ブログ の記事から

    https://medium.com/nvidiajapan/62b7c70e732a
  20. 20 GPU COMPUTING IN FUTURE

  21. 21 GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE Requires a

    New Architecture GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec System Bandwidth Bottleneck DDR4 HBM2e GPU GPU GPU GPU x86 ELMo (94M) BERT-Large (340M) GPT-2 (1.5B) Megatron-LM (8.3B) T5 (11B) Turing-NLG (17.2B) GPT-3 (175B) 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 2018 2019 2020 2021 2022 2023 Model Size (Trillions of Parameters) 100 TRILLION PARAMETER MODELS BY 2023
  22. 22 NVIDIA GRACE Breakthrough CPU Designed for Giant-Scale AI and

    HPC Applications FASTEST INTERCONNECTS >900 GB/s Cache Coherent NVLink CPU To GPU (14x) >600GB/s CPU To CPU (2x) NEXT GENERATION ARM NEOVERSE CORES >300 SPECrate2017_int_base est. Availability 2023 HIGHEST MEMORY BANDWIDTH >500GB/s LPDDR5x w/ ECC >2x Higher B/W 10x Higher Energy Efficiency
  23. 23 TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING Evolving Architecture For New

    Workloads CURRENT x86 ARCHITECTURE DDR4 HBM2e INTEGRATED CPU-GPU ARCHITECTURE LPDDR5x HBM2e 3 DAYS FROM 1 MONTH Fine-Tune Training of 1T Model GPU GPU GPU GPU GRACE GRACE GRACE GRACE GPU GPU GPU GPU x86 Transfer 2TB in 30 secs Transfer 2TB in 1 secs GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec GPU 8,000 GB/sec CPU 500 GB/sec NVLink 500 GB/sec Mem-to-GPU 2,000 GB/sec REAL-TIME INFERENCE ON 0.5T MODEL Interactive Single Node NLP Inference Bandwidth claims rounded to nearest hundred for illustration. Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100. Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100) Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.
  24. 24 NVIDIA 秋の HPC Weeks Week 1 : 2021 年

    10 ⽉ 11 ⽇ (⽉) GPU Computing & Network Deep Dive Week 2 : 2021 年 10 ⽉ 18 ⽇ (⽉) HPC + Machine Learning Week 3 : 2021 年 10 ⽉ 25 ⽇ (⽉) GPU Applications https://events.nvidia.com/hpcweek/ Stephen W. Keckler NVIDIA Torsten Hoefler ETH Zürich ⻘⽊ 尊之 東京⼯業⼤学 Tobias Weinzierl Durham University James Legg University College London Mark Turner Durham University 岡野原 ⼤輔 Preferred Networks 横⽥ 理央 東京⼯業⼤学 美添 ⼀樹 九州⼤学 秋⼭ 泰 東京⼯業⼤学 市村 強 東京⼤学 ⾼⽊ 知弘 京都⼯芸繊維⼤学
  25. 25 SUMMARY Current NVIDIA data center GPU A100 & A30

    for FP64, A40 & A10 for FP32 GPU accelerated application performance Simulia CST Studio, Altair CFD and Rocky DEM got excellent performance on GPU Paraview with GPU acceleration Ray tracing accelerated with RT core In future Grace CPU improve memory bandwidth between CPU and GPU
  26. None