Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPU で加速される様々なアプリケーション / 2021-11-25 GPU2021

GPU で加速される様々なアプリケーション / 2021-11-25 GPU2021

2021/11/25 - GPU Computing Workshop for Advanced Manufacturing (GPU2021)

Shinnosuke Furuya

November 25, 2021
Tweet

More Decks by Shinnosuke Furuya

Other Decks in Technology

Transcript

  1. スパコンランキング TOP500 上位 7 / 10 が NVIDIA GPU を搭載

    https://www.top500.org システム名 概要 サイト 性能 (TFlops) 2: Summit IBM POWER, NVIDIA V100, NVIDIA Mellanox IB EDR アメリカ 148,600.0 3: Sierra IBM POWER, NVIDIA V100, NVIDIA Mellanox IB EDR アメリカ 94,640.0 5: Perlmutter AMD EPYC, NVIDIA A100, HPE Slingshot アメリカ 70,870.0 6: Selene AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR アメリカ 63,460.0 8: JUWELS Booster Module AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR ドイツ 44,120.0 9: HPC5 Intel Xeon, NVIDIA V100, NVIDIA Mellanox IB HDR イタリア 35,450.0 10: Voyager-EUS2 AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR アメリカ 30,050.0
  2. スパコンランキング GREEN500 上位 9 / 10 が NVIDIA GPU を搭載

    https://www.top500.org システム名 概要 サイト 電⼒効率 (GFlops/W) 2: SSC-21 Scalable Module AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR200 韓国 33.983 3: Tethys AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR アメリカ 31.538 4: Wilkes-3 AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR200 イギリス 30.797 5: HiPerGator AI AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR アメリカ 29.521 6: Snellius Phase 1 GPU Intel Xeon, NVIDIA A100, NVIDIA Mellanox IB HDR オランダ 29.046 7: Perlmutter AMD EPYC, NVIDIA A100, HPE Slingshot アメリカ 27.374 8: Karolina, GPU partition AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR200 チェコ 27.213 9: MeluXina - Accelerator Module - AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR ルクセンブルク 26.957 10: NVIDIA DGX SuperPOD AMD EPYC, NVIDIA A100, NVIDIA Mellanox IB HDR アメリカ 26.195
  3. スパコンランキング TOP500 アクセラレータのトレンドは NVIDIA GPU 0 20 40 60 80

    100 120 140 160 Jun-11 Nov-11 Jun-12 Nov-12 Jun-13 Nov-13 Jun-14 Nov-14 Jun-15 Nov-15 Jun-16 Nov-16 Jun-17 Nov-17 Jun-18 Nov-18 Jun-19 Nov-19 Jun-20 Nov-20 Jun-21 Nov-21 NVIDIA Other
  4. ANNOUNCING NVIDIA CUNUMERIC Accelerated Computing At-Scale for PyData and NumPy

    Ecosystem Transparently Accelerates and Scales NumPy Workflows Zero Code Changes Automatic Parallelism and Acceleration for Multi-GPU, Multi-Node Systems Scales to 1,000s of GPUs Available Now on GitHub and Conda cuNumeric NVIDIA Python Data Science and Machine Learning Ecosystem cuDF Pandas Scikit-Learn NetworkX NumPy cuML cuGraph
  5. ANNOUNCING NVIDIA MODULUS Physics-ML Neural Simulation Framework Framework for Developing

    Physics-ML Models Train Physics-ML Models Using Governing Physics, Simulation, and Observed Data Multi-GPU, Multi-Node Training 1,000-100,000X Speed Models – Ideal for Digital Twins SymPy Equation Model Library (SiREN, PINO, PINN, MESHFREE) Multi-Node Multi-GPU Training Engine Numerical Optimization Plans Geometry ICs & BCs Observations Computational Graph Compiler Available Now developer.nvidia.com/modulus
  6. HPC PROGRAMMING IN ISO FORTRAN ISO is the place for

    portable concurrency and parallelism Fortran 2018 Fortran 202x Array Syntax and Intrinsics Ø NVFORTRAN 20.5 Ø Accelerated matmul, reshape, spread, … DO CONCURRENT Ø NVFORTRAN 20.11 Ø Auto-offload & multi-core Co-Arrays Ø Coming Soon Ø Accelerated co-array images DO CONCURRENT Reductions Ø NVFORTRAN 21.11 Ø REDUCE subclause added Ø Support for +, *, MIN, MAX, IAND, IOR, IEOR. Ø Support for .AND., .OR., .EQV., .NEQV on LOGICAL values Ø Atomics Preview support available now in NVFORTRAN
  7. NVIDIA A2 Versatile Entry-Level GPU Bringing NVIDIA AI to Any

    Server Compact, Entry-Level Inference Single slot LP, lower power – fits any server & optimal for thermally constrained systems Latest Ampere Architecture Features 3rd gen Tensor cores, 2nd gen RT cores, Secure RoT Higher Intelligent Video Analytics (IVA) Performance 1.3X better performance vs T4 Up to 20X Higher Performance versus CPU Speedups for AI inference and cloud gaming A2 GPU Architecture Ampere GPU Memory 16GB GDDR6 Interconnect PCIe Gen 4 (x8) Form Factor Low-Profile (1-slot) Max Power 40-60W Schedule* Mass Production Nov ‘21 NVIDIA A2 *Server availability 1-3 months later
  8. NVIDIA GPUS AT A GLANCE Fermi (2010) Kepler (2012) M2090

    Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 P4 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 A40 A2 A16
  9. DATA CENTER PRODUCT COMPARISON (NOV 2021) A100 A30 A40 A2

    Performance FP64 (no Tensor Core) 9.7 TFlops 5.2 TFlops - - FP64 Tensor Core 19.5 TFlops 10.3 TFlops - - FP32 (no Tensor Core) 19.5 TFlops 10.3 TFlops 37.4 TFlops 4.5 TFlops TF32 Tensor Core 156 | 312* TFlops 82 | 165* TFlops 74.8 | 149.6* TFlops 9 | 18* TFlops FP16 Tensor Core 312 | 624* TFlops 165 | 330* TFlops 149.7 | 299.4* TFlops 18 | 36* TFlops BF16 Tensor Core 312 | 624* TFlops 165 | 330* TFlops 149.7 | 299.4* TFlops 18 | 36* TFlops Int8 Tensor Core 624 | 1248* TOPS 330 | 661* TOPS 299.3 | 598.6* TOPS 36 | 72* TOPS Int4 Tensor Core 1248 | 2496* TOPS 661 | 1321* TOPS 598.7 | 1197.4* TOPS 72 | 144* TOPS Form Factor SXM4 module on base board x16 PCIe Gen4 2 Slot FHFL 3 NVLink bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLink bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLink bridge x8 PCIe Gen4 1 Slot LP GPU Memory 80 GB HBM2e 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 16 GB GDDR6 GPU Memory Bandwidth 1935 GB/s 2039 GB/s 933 GB/s 696 GB/s 200 GB/s Multi-Instance GPU Up to 7 Up to 7 Up to 4 - - Media Acceleration 1 JPEG Decoder 5 Video Decoder 1 JPEG Decoder 4 Video Decoder 1 Video Encoder 2 Video Decoder (+AV1 decode) 1 Video Encoder 2 Video Decoder (+AV1 Decode) Ray Tracing No No No Yes Yes Graphics For in-situ visualization (no NVIDIA vPC or RTX vWS) Best Good Max Power 400 W 300 W 165 W 300 W 40-60 W * Performance with structured sparse matrix
  10. INTRODUCTION TO VASP Scientific Background § Most widely used GPU-accelerated

    software for electronic structure of solids, surfaces, and interfaces § Generates § Chemical and physical properties § Reactions paths § Capabilities § First principles scaled to 1000s of atoms § Materials and properties - liquids, crystals, magnetism, semiconductors/insulators, surfaces, catalysts § Solves many-body Schrödinger equation § Quantum-mechanical methods and solvers § Density Functional Theory (DFT) § Plane-wave based framework § New implementations for hybrid DFT (HF exact exchange)
  11. FEATURES AVAILABLE AND ACCELERATED IN VASP 6.2 Existing acceleration New

    acceleration Acceleration work in progress On acceleration roadmap LEVELS OF THEORY Standard DFT (incl. meta-GGA, vdW-DFT) Hybrid DFT (double buffered) Cubic-scaling RPA (ACFDT, GW) Bethe-Salpeter Equations (BSE) … PROJECTION SCHEME Real space Reciprocal space EXECUTABLE FLAVORS Standard variant Gamma-point simplification variant Non-collinear spin variant SOLVERS / MAIN ALGORITHM Davidson (+Adaptively Compressed Exch.) RMM-DIIS Davidson+RMM-DIIS Direct optimizers (Damped, All) Linear response
  12. 0 2 4 6 8 10 12 14 16 2x

    EPYC 1x A100 2x A100 4x A100 8x A100 6.1.2 6.2.0 VASP VERSION UPDATES BRING NEW ACCELERATION Dataset: Si256_VJT_HSE06 Better than 22% improvement VASP versions Speedup - relative to 6.1.2 on Epyc 6.1.2 6.1.2 6.1.2 6.2.0 6.2.0 6.2.0 6.2.0 6.1.2 CPU only Rome 7742 6.2.0 6.1.2 # of GPUs (A100 SXM4 80 GB)
  13. INTRODUCTION TO LAMMPS Scientific Background §LAMMPS stands for “Large-scale Atomic/Molecular

    Massively Parallel Simulator”. It is an open-source molecular dynamics simulation application for materials modeling both solid-state and soft matter. It can also do coarse grained simulations for larger particles. Development is funded by DOE and primary developers are at Sandia National Laboratory. This app is used all over the world by a wide variety of industries including semiconductors and pharmaceuticals. §LAMMPS Distributions §Github §NGC container §MedeA by Materials Design Lipids immobilizing water into droplets
  14. GPU ACCELERATED FEATURES IN LAMMPS Primary GPU Acceleration Enabled in

    KOKKOS §Virtually all features in LAMMPS are accelerated on NVIDIA GPUs using Kokkos. Performance varies by input and method. §Most users will be familiar capabilities involved in “interatomic potentials” such as §Pairwise potentials like Lennard-Jones §Many-body potentials like EAM, ReaxFF, SNAP §Long-range interactions like PPPM for Ewald / particle mesh Ewald §Compatibility with force fields from CHARMM, AMBER, GROMACS, COMPASS §Ongoing development of NVIDIA acceleration capabilities happens through partnership with LAMMPS developers and NVIDIA Devtech organization. NVIDIA leads include Evan Weinberg and Kamesh Arumugam. Each release has additional enhancements – so keep a look out for these updates.
  15. LAMMPS CPU & GPU COMPARISON 0 2000000 4000000 6000000 8000000

    10000000 12000000 EPYC 7742 A10 A40 A30 A100-PCIE-40GB A100-SXM 4-40GB A100-SXM -80GB Timestep / sec (higher is better) 0 1 2 4 8 Number of GPUs AMD LAMMPS patch_10Feb2021 Dataset: SNAP 0 500000000 1E+09 1.5E+09 2E+09 2.5E+09 3E+09 3.5E+09 4E+09 EPYC 7742 A10 A40 A30 A100-PCIE-40GB A100-SXM 4-40GB A100-SXM -80GB Timestep / sec (higher is better) 0 1 2 4 8 Number of GPUs AMD LAMMPS patch_10Feb2021 Dataset: Atomic Fluid Lennard-Jones 2.5 cutoff AMD AMD
  16. OTHER HPC APPLICATIONS NVIDIA HPC Application Performance | NVIDIA Developer

    https://developer.nvidia.com/hpc-application-performance/
  17. AMBER 1 1 1 1 1 21 18 33 34

    22 41 37 67 69 44 82 74 134 137 87 164 147 268 274 174 0 50 100 150 200 250 300 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP AMBER 20.12-AT_21.10 Dataset: DC-Cellulose_NPT 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 1 1 1 1 1 22 21 37 39 23 44 41 75 78 46 89 82 150 156 92 178 165 300 311 183 0 50 100 150 200 250 300 350 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP AMBER 20.12-AT_21.10 Dataset: DC-STMV_NPT 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  18. GROMACS 1 1 1 1 1 5 3 9 9

    4 9 5 12 13 9 13 10 16 21 14 12 18 23 0 5 10 15 20 25 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP GROMACS 2021.3 Dataset: Cellulose 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 1 1 1 1 1 5 3 6 6 4 10 5 11 11 7 14 9 15 16 12 17 14 22 31 14 0 5 10 15 20 25 30 35 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP GROMACS 2021.3 Dataset: STMV 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  19. NAMD 1 1 1 1 1 4 4 6 6

    5 9 8 12 13 10 18 15 25 26 20 36 30 50 51 39 0 10 20 30 40 50 60 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP NAMD V3.0a9, V2.15a (CPU only) Dataset: apoa1_npt_cuda 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 1 1 1 1 1 4 4 6 7 4 8 7 13 13 9 16 14 26 27 17 32 28 52 54 34 0 10 20 30 40 50 60 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP NAMD V3.0a9, V2.15a (CPU only) Dataset: stmv_npt_cuda 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  20. LAMMPS 1 1 1 1 1 3 5 5 3

    2 5 9 10 6 4 9 17 19 12 7 17 27 35 22 0 5 10 15 20 25 30 35 40 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP LAMMPS stable_29Sep2021 Dataset: LJ 2.5 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 1 1 1 1 1 8 15 15 9 3 15 26 26 17 5 27 43 43 31 12 40 57 61 49 0 10 20 30 40 50 60 70 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP LAMMPS stable_29Sep2021 Dataset: ReaxFF/C 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  21. QUANTUM ESPRESSO 1 1 1 1 3 6 6 3

    7 9 10 6 9 11 15 9 17 12 0 2 4 6 8 10 12 14 16 18 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP Quantum Espresso 6.8, 6.7 (CPU only) Dataset: AUSURF112-jR 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  22. FUN3D 1 1 1 1 1 2 5 12 13

    6 5 12 23 24 13 11 22 40 41 25 20 36 53 57 43 0 10 20 30 40 50 60 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP FUN3D 13.7 (update 1) Dataset: dpw_wbt0_crs-3.6Mn_5 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  23. SPECFEM3D 1 1 1 1 1 11 15 29 30

    14 22 28 56 56 28 42 54 103 104 52 77 98 151 159 90 0 20 40 60 80 100 120 140 160 180 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP SPECFEM3D devel_0a5acff9 Dataset: four_material_simple_model 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  24. CHROMA 1 1 1 1 1 15 26 32 7

    28 33 46 55 37 52 62 84 99 68 89 103 129 163 111 0 20 40 60 80 100 120 140 160 180 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP Chroma V 2021.05 Dataset: szscl21_24_128 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  25. ICON 1 1 1 1 1 3 3 6 6

    3 4 5 7 9 5 11 8 5 13 10 0 2 4 6 8 10 12 14 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP ICON 2.6.2+RRTMGP Dataset: SLAM 191 levels 160 km resolution without radiation 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU 1 1 1 1 1 2 3 6 6 3 4 5 9 10 6 5 6 15 10 7 8 18 13 0 2 4 6 8 10 12 14 16 18 20 A40 A30 A100 PCIe 80GB A100 SXM 80GB V100 SXM 32GB SPEEDUP ICON 2.6.2+RRTMGP Dataset: SLAM 191 levels 160 km resolution with radiation 0 GPU 1 GPU 2 GPU 4 GPU 8 GPU
  26. 宣伝 PC クラスタコンソーシアム 設⽴ 20 周年記念 PC クラスタシンポジウム § エヌビディア

    コンピューティング プラットフォーム最新情報 | 佐々⽊邦暢 (エヌビディア合同会社) § CPU、GPU、DPU と 3 種類のプロセッサと、それらの性能 を引き出すソフトウェアを提供するエヌビディアは、「コン ピューティング プラットフォーム カンパニー」へと変貌しつつあ ります。このセッションでは、11 ⽉開催の NVIDIA GTC や SC21 での発表内容を踏まえ、GPU コンピューティングと ネットワーキングに関する最新の情報をお伝えします。 § GPU から⾒たワークロードと⾼速化の歴史と展望 | 井﨑武 ⼠ (エヌビディア合同会社) § 2006 年の CUDA 発表以来、GPU を中⼼としたハード ウェア、ソフトウェアの性能向上はまさにワークロード変遷の 歴史でもあります。2012 年ディープラーニングが脚光を浴 びて以降、ワークロードは HPC 中⼼から AI / ディープラーニ ングとのハイブリッドな形態へと変遷しており、NVIDIA もそ れに合わせて、ハードウェア、ソフトウェアの両⾯での技術⾰ 新を継続しております。本セッションでは、これまでの歴史 と現在そして今後の展望を交え、NVIDIA の取り組みをご 紹介します。 イベント名 PCCC21 「『PC クラスタ』これからの 10 年」 ⽇程 2021 年 12 ⽉ 8 ⽇ (⽔) 〜 9 ⽇ (⽊) 開催形式 Zoom によるオンライン視聴 / 会員 WEB 展⽰ / GatherTown 交流会 主催 PC クラスタコンソーシアム 参加費 無料 (事前申込制) 参加申込締切 2021 年 12 ⽉ 6 ⽇ (⽉) 正午 詳細 https://www.pccluster.org/ja/event/pccc21/