Upgrade to Pro — share decks privately, control downloads, hide ads and more …

計算力学シミュレーションを支える NVIDIA の最新情報 / 2023-10-26 CMD2023

計算力学シミュレーションを支える NVIDIA の最新情報 / 2023-10-26 CMD2023

Shinnosuke Furuya

October 26, 2023
Tweet

More Decks by Shinnosuke Furuya

Other Decks in Technology

Transcript

  1. • GPU Classification and Generation • NVIDIA Grace CPU •

    NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda
  2. NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090

    Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 L4
  3. NVIDIA GPUs at a Glance Fermi (2010) Kepler (2012) M2090

    Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 Data Center GPU RTX / Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100 Ada Lovelace (2022) RTX 6000 Ada Gen RTX 4090 L40 L4 Professional Visualization | FP32 Gaming | FP32 VDI | FP32 HPC/AI Computing | FP32 HPC/AI Computing | FP64, FP32
  4. • GPU Classification and Generation • NVIDIA Grace CPU •

    NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda
  5. NVIDIA Grace CPU Building Block of the Superchip • High

    Performance Power Efficient Cores • 72 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bisection bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 480 GB of data center enhanced LPDDR5X Memory that delivers up to 500 GB/s of memory bandwidth • Coherent Chip-to-Chip Connections • NVLink-C2C with 900 GB/s bandwidth for coherent connection to CPU or GPU • Industry Leading Performance Per Watt • Up to 2X perf / W over today’s leading servers
  6. NVLINK-C2C High Speed Chip to Chip Interconnect • Creates Grace

    Hopper and Grace Superchips • Removes the typical cross-socket bottlenecks • Up to 900GB/s of raw bidirectional BW • Same BW as GPU to GPU NVLINK on Hopper • Low power interface - 1.3 pJ/bit • More than 5x more power efficient than PCIe • Enables coherency for both Grace and Grace Hopper superchips GRACE CPU NVLINK C2C 900 GB/s CPU LPDDR5X CPU LPDDR5X ≤ 512 GB/s
  7. NVIDIA Grace for Cloud, AI and HPC Infrastructure Accelerated applications

    where CPU performance and system memory size and bandwidth are critical; tightly coupled CPU & GPU for flagship AI & HPC. Most versatile compute platform for scale out. GH200 Grace Hopper Superchip Large Scale AI & HPC CPU-based applications where absolute performance, energy efficiency, and data center density matter, such as scientific computing, data analytics, enterprise and hyperscale computing applications Grace CPU Superchip CPU Computing
  8. • GPU Classification and Generation • NVIDIA Grace CPU •

    NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda
  9. NVIDIA Grace CPU Superchip 2X Performance at the Same Power

    for the Modern Data Center • High Performance Power Efficient Cores • 144 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bi-section bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 960GB of data center enhanced LPDDR5X Memory that delivers up to1TB/s of memory bandwidth • Fast and Flexible CPU IO • Up to 8x PCIe Gen5 x16 interface • PCIe Gen 5 up to 128GB/s; 2X more bandwidth compared to PCIe Gen 4 • Full NVIDIA Software Stack • AI, Omniverse
  10. NVIDIA Grace is a Compute and Data Movement Architecture NVIDIA

    Scalable Coherency Fabric and Distributed Cache Design • 72 high performance Arm Neoverse V2 cores with 4x128b SVE2 • 3.2 TB/s bisection bandwidth • 117MB of L3 cache • Local caching of remote die memory • Background data movement via Cache Switch Network
  11. Low-Power High-Bandwidth Memory Subsystem LPDDR5X Data Center Enhanced Memory •

    Optimal balance between bandwidth, energy efficiency and capacity • Up to 1TB/s of raw bidirectional BW • 1/8th power per GB/s vs conventional DDR memory • Similar cost / bit to conventional DDR memory • Data Center class memory with error code correction (ECC) GRACE CPU GRACE CPU NVLINK C2C 900 GB/s CPU LPDDR5X CPU LPDDR5X CPU LPDDR5X CPU LPDDR5X ≤ 512 GB/s ≤ 512 GB/s
  12. NVIDIA Grace CPU Delivers 1.9X HPC Data Center Throughput at

    the Same Power Breakthrough Performance and Efficiency 0.6X 0.6X 0.6X 0.7X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.0X 1.1X 0.0X 0.5X 1.0X 1.5X 2.0X 2.5X Weather WRF MD CP2K Climate NEMO CFD OpenFOAM Server Performance Intel SPR AMD Genoa NVIDIA Grace 0.6X 0.6X 0.7X 0.7X 1.0X 1.0X 1.0X 1.0X 1.7X 1.8X 1.9X 1.9X 0.0X 0.5X 1.0X 1.5X 2.0X 2.5X Weather WRF MD CP2K Climate NEMO CFD OpenFOAM Data Center Throughput Intel SPR AMD Genoa NVIDIA Grace Data Center level projection of NVIDIA Grace Superchip vs x86 flagship 2-socket data center systems (AMD Epyc 9654 and Intel Xeon 8480+). MD: CP2K RPA 2023.1 Climate: NEMO Gyre_Pisces v4.2.0 Weather: CONUS12, 24 hr simulation 4.4.2 CFD: OpenFOAM Motorbike | Large v2212 NVIDIA Grace Superchip performance based on engineering measurements. Results subject to change.
  13. • GPU Classification and Generation • NVIDIA Grace CPU •

    NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda
  14. NVIDIA Grace CPU Building Block of the Superchip • High

    Performance Power Efficient Cores • 72 flagship Arm Neoverse V2 Cores with SVE2 4x128b SIMD per core • Fast On-Chip Fabric • 3.2 TB/s of bisection bandwidth connects CPU cores, NVLink-C2C, memory, and system IO • High-Bandwidth Low-Power Memory • Up to 480 GB of data center enhanced LPDDR5X Memory that delivers up to 500 GB/s of memory bandwidth • Coherent Chip-to-Chip Connections • NVLink-C2C with 900 GB/s bandwidth for coherent connection to CPU or GPU • Industry Leading Performance Per Watt • Up to 2X perf / W over today’s leading servers
  15. NVIDIA Hopper H100 GPU Breakthrough Performance and Efficiency for the

    Modern Data Center • Highest AI and HPC Performance • 4PF FP8 (6X) | 2PF FP16 (3X) | 1PF TF32 (3X) | 67TF FP64 (3.4X) • 4TB/s (2X), 96GB HBM3 memory • Transformer Engine • 4th generation Tensor Core optimized for Transformer Models • 6X faster on largest transformer models • Highest Utilization Efficiency and Security • 7 Fully isolated & secured instances, 2nd Gen MIG • Fastest, Scalable Interconnect • 4th Gen NVLink 900 GB/s GPU-to-GPU connectivity up to 256 linked GPUs with NVLink Switch System
  16. NVIDIA GH200 Grace Hopper Superchip Processor For The Era of

    Accelerated Computing And Generative AI 72 Core Grace CPU | 4 PFLOPS Hopper GPU 144 GB HBM3e | 5 TB/s | 900 GB/s NVLink-C2C 144 Core Grace CPU | 8 PFLOPS Hopper GPU 288 GB HBM3e | 10 TB/s | 900 GB/s NVLink-C2C 72 Core Grace CPU | 4 PFLOPS Hopper GPU 96 GB HBM3 | 4 TB/s | 900 GB/s NVLink-C2C • World’s first HBM3e GPU • Combined 624 GB of fast memory • 1.7x capacity and 1.5x bandwidth vs H100 • Full NVIDIA Compute Stack • Simple to deploy MGX-compatible design • Combined 1.2 TB fast memory • 3.5x capacity and 3x bandwidth vs H100 • Full NVIDIA Compute Stack • 7X bandwidth to GPU vs PCIe Gen 5 • Combined 576 GB of fast memory • 1.2x capacity and bandwidth vs H100 • Full NVIDIA Compute Stack GH200 with HBM3 GH200 with HBM3e NVLink Dual-GH200 System
  17. Energy Efficient Design More Efficient Computation and Data Movement System

    Memory (DDR5) 35 pJ/Bit 6.5 pJ/Bit 99 pJ/Flop DP 12 pJ/Flop DP PCIe Gen 5 62 pJ/Flop DP 1.6X less energy System Memory (LPDDR5X) 5 pJ/Bit 7X less energy 12 pJ/Flop DP Equal energy 1.3 pJ/Bit 5X less energy NVLink-C2C
  18. Optimizing Performance Through Power Steering Getting the Most Out of

    provisioned power 0 100 200 300 400 500 600 700 Initial Provisioned Power (650W) x86+H100 GPU Heavy Phase w/out Power Steering GH200 GPU Heavy Phase w/ Power Steering Power (W) Chart Title CPU GPU 200W of CPU power shifted to GPU to maximize App perf. Total Power stays fixed Provisioned Power Limit (650W)
  19. GH200 Grace Hopper HPC Platform Unified Memory and Cache Coherence

    for Next Gen HPC Performance Fast Access Memory 624GB Memory Bandwidth 5TB/s Partially GPU Accelerated Apps No More PCIe Bottleneck CPU & GPU Cache Coherence Big performance gains with no code changes Incremental code changes yield big gains NVLink-C2C is 7X PCIe BW HPC: Preliminary results comparing DGX A100, DGX H100 and GH200 with HBM3 systems. OpenFOAM based on MotorBike, NAMD with Colvars, CP2K with RPA,
  20. • GPU Classification and Generation • NVIDIA Grace CPU •

    NVIDIA Grace CPU Superchip • NVIDIA Grace Hopper Superchip • NVIDIA Modulus Agenda
  21. AI/ML for Physical Systems What is different? Physics based machine

    learning Physical system ≈ system governed by first principles or governing partial differential equations
  22. Promise of PINNs Data Only vs PINN: Solving The Data

    Problem • Neural Networks are functions that can be modified to represent almost any other function • Target function: f(x) • NN to approximate it: u(x;W)≅f(x) • Training: find weights W that minimize mismatch at selected data points • Given enough data, Neural Networks can approximate almost any function to any degree of accuracy • But… collecting field data may not always be possible • If we understand the physical laws behind the data, then we can generate enough ! " = "! + 1 &" &"" ' " = 0
  23. Open-Source AI Toolkit for Physics-based ML Modulus • Python based

    APIs for ease of use • Import your PyTorch model* • Reference case studies and recipes as starting points • Facilitates open collaboration within the Physics-ML scientific community • Model architecture Zoo • Well documented features and functionality for ease of use • Source code: https://github.com/NVIDIA/modulus
  24. To Learn More about Modulus • はじめてのNVIDIA Modulus –Physics-ML 物理に基づい

    た機械学習による工学シミュレーション– • 著者: 柴田 良一 (著)、NVIDIA (監修) • 出版社: 工学社 • 発売日: 2023年7月26日 • https://honto.jp/netstore/pd-book_32621454.html • CAE 懇話会 解析塾: Modulus 入門 • 日時: 2023年12月7日 (木) 10:00-16:00 / オンデマンド • 講師: 柴田 良一 教授 (岐阜高専)、丹 愛彦 (NVIDIA) • 会場: 新大阪 • 受講料: 有償 • http://www.cae21.org/kaisekijuku2023/modulus_2023.s html
  25. NVIDIA at SC23 November 12-17, 2023 | Colorado Convention Center,

    Denver, Colorado • Many NVIDIA sessions: • 7 Tutorials • 5 Workshops • 4 Exhibitor Forums • 4 Papers • 15 Birds of a Feathers (BoF) • 2 Panels • https://www.nvidia.com/en- us/events/supercomputing/
  26. NVIDIA GTC March 18-21, 2024 | San Jose Convention Center,

    San Jose, CA and Virtual Come to GTC—the conference for the era of AI—to connect with a dream team of industry luminaries, developers, researchers, and business experts shaping what’s next in AI and accelerated computing. From the highly anticipated keynote by NVIDIA CEO Jensen Huang to over 600 inspiring sessions, 200+ exhibits, and tons of networking events, GTC delivers something for every technical level and interest area. Be sure to save your spot for this transformative event. You can even take advantage of early-bird pricing when you register by February 7. March 18-21, 2024 | www.nvidia.com/gtc The In-Person GTC Experience Is Back Plask