GTC 2022 Re:Cap / 2022-04-20 JAWS-UG-HPC

GTC 2022 RE:CAP SHINNOSUKE FURUYA, PH.D., HPC DEVELOPER RELATIONS, NVIDIA
JAWS-UG HPC#18, 2022/04/20

「ALEXA、古家ってどんな⼈？」 §名前: 古家真之介 (ふるやしんのすけ) §職業: HPC DevRel @NVIDIA
<-- SA/SE @ARGOGRAPHICS <-- PD @U-Tokyo §Twitter: @sfuruyaz §最近の活動: §[記事] HPC と物理学 ― GPU コンピューティングが拓く新しい世界 | ⽇本物理学会誌 §https://www.jps.or.jp/books/gakkaishi/2021/12/76-12.php §[講演] GPU で加速される様々なアプリケーション | GDEP Solutions GPU2021 §https://www.gdep-sol.co.jp/gpu2021-day3.html §[討論] C++ と GPU プログラミングの現状・期待 | PC クラスターコンソーシアム WS §https://www.pccluster.org/ja/event/2022/03/220415-hpcoss-ws.html §[企画/運営] NVIDIA 秋の HPC Weeks §https://nvidia.connpass.com/event/225000/ §[企画/運営] GPU ミニキャンプ (GPU スパコンもくもく会) §東⼤ (Reedbush, Wisteria/BDEC-01), 東⼯⼤ (TSUBAME), 筑波⼤ (Cygnus), 名⼤ (不⽼), 阪⼤ (SQUID), 九⼤ (ITO), AIST (ABCI)

NVIDIA GTC 2022 https://www.nvidia.com/gtc/

AGENDA • NVIDIA HOPPER • NVIDIA GRACE • GPU AND
CPU PORTFOLIO AVAILABILITY • DGX SYSTEMS • NVIDIA NETWORKING • NVIDIA HPC PLATFORM • GPU PROGRAMMING • SUMMARY

NVIDIA HOPPER

AlexNet VGG-19 Seq2Seq Resnet InceptionV3 Xception ResNeXt DenseNet201 ELMo MoCo
ResNet50 Wav2Vec 2.0 Transformer GPT-1 BERT Large GPT-2 XLNet Megatron Microsoft T-NLG GPT-3 Switch Transformer 1.6T Megatron-Turing NLG 530B 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 1,000,000,000 10,000,000,000 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Training PetaFLOPS NEXT WAVE OF AI REQUIRES PERFORMANCE AND SCALABILITY TRANSFORMERS TRANSFORMING AI EXPLODING COMPUTATIONAL REQUIREMENTS Transformer AI Models = 275x / 2yrs AI Models Excluding Transformers = 8x / 2yrs MegaMolBART: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/megamolbart | SegFormer: https://arxiv.org/abs/2105.15203 | Decision Transformer: https://arxiv.org/pdf/2106.01345.pdf | SuperGLUE: https://super.gluebenchmark.com/leaderboard Exploding Computational Requirements, source: NVIDIA Analysis and https://github.com/amirgholami/ai_and_memory_wall SEGFORMER Sematic Segmentation DECISION TRANSFORMER Reinforcement Learning MEGAMOLBART Drug Discovery with AI SUPERGLUE LEADERBOARD Difficult NLU Tasks HIGHER PERFORMANCE AND SCALABILITY Scale (# GPUs) Performance Today Desired GPT-3 (175B parameters) 3.5 months to train on 128x A100 70% AI Papers In last 2 years discuss Transformer Models

ANNOUNCING NVIDIA HOPPER The New Engine for the World’s AI
Infrastructure Custom 4N TSMC Process | 80 billion transistors 4th Gen NVLink Transformer Engine 2nd Gen MIG Confidential Computing DPX Instructions World’s Most Advanced Chip

TRANSFORMER ENGINE Tensor Core Optimized for Transformer Models § 6X
Faster Training and Inference of Transformer Models § NVIDIA Tuned Adaptive Range Optimization Across 16-bit and 8-bit Math § Configurable Macro Blocks Deliver Performance Without Accuracy Loss 8-bit 16-bit Statistics and Adaptive Range Tracking Statistics Adaptive Range

HOPPER TECHNOLOGICAL BREAKTHROUGHS NEW DYNAMIC PROGRAMING INSTRUCTIONS Accelerate Dynamic Programming
Algorithms MULTI-GPU INSTANCE 7 Secure Tenants on 1 GPU CONFIDENTIAL COMPUTING Secure Data and AI Models In-Use Optimization Omics Graph Analytics Data Processing A BROAD RANGE OF USE CASES 1X 35X 40X CPU Genomics Routing Optimization REAL-TIME PERFORMANCE Dynamic Programming Instructions HGX H100 4-GPU vs dual socket 32 core IceLake

ANNOUNCING NVIDIA H100 Unprecedented Performance, Scalability, and Security for Every
Data Center FP8, FP16, TF32 performance include sparsity. X-factor compared to A100 HIGHEST AI AND HPC PERFORMANCE 4PF FP8 (6X)| 2PF FP16 (3X)| 1PF TF32 (3X)| 60TF FP64 (3X) 3TB/s (1.5X), 80GB HBM3 memory TRANFORMER MODEL OPTIMIZATIONS 6X faster on largest transformer models HIGHEST UTILIZATION EFFICIENCY AND SECURITY 7 Fully isolated & secured instances, guaranteed QoS 2nd Gen MIG | Confidential Computing FASTEST, SCALABLE INTERCONNECT 900 GB/s GPU-2-GPU connectivity (1.5X) up to 256 GPUs with NVLink Switch | 128GB/s PCI Gen5

NVLINK SWITCH SYSTEM Enabling Multi-Node NVLink Up to 256 GPUs
4th GEN NVLINK 900 GB/s from 18x25GB/sec bi-directional ports GPU-2-GPU connectivity across nodes 3rd GEN NVSWITCH All-to-all NVLink switching for 8-256 GPUs Accelerate collectives - multicast and SHARP NVLINK SWITCH 128 port cross-connect based on NVSwitch H100 CLUSTER (1 SCALABLE UNIT) 57,600 GB/s all-to-all bandwidth 32 servers | 18 NVLink switches | 1,152 NVLink optical cables

H100 BRINGS ORDER-OF-MAGNITUDE LEAP IN PERFORMANCE Performance and Scalability for
Next Generation Breakthroughs 6X 7X 0X 1X 2X 3X 4X 5X 6X 7X 3D FFT Genome Sequencing HPC HPC App Performance vs A100 Up to 7X Higher Performance Throughput – Sequences / Sec 128 4000 8000 TRAINING 20 hrs to train 9X Faster Mixture of Experts (395B) Training vs A100 Up to 9X More Throughput H100 A100 7 days to train Megatron 530B Throughput vs A100 #GPUs Projected performance subject to change Training Mixture of Experts (MoE) Transformer Switch-XXL variant with 395B parameters on 1T token dataset | Inference on Megatron 530B parameter model based chatbot for input sequence length=128, output sequence length =20 | HPC 3D FFT (4K^3) throughput | A100 cluster: HDR IB network | H100 cluster: NVLink Switch System, NDR IB HPC Genome Sequencing (Smith-Waterman) | 1 A100 | 1 H100 16X 20X 30X 0X 5X 10X 15X 20X 25X 30X 2 sec 1.5 sec 1 sec Throughput per GPU Latency INFERENCE Up to 30X More Throughput

ANNOUNCING HGX-H100 The World’s Most Advanced Enterprise AI Infrastructure Tensor
Core FLOPs shown with sparsity | Speedups compared to prior generation HIGHEST PERFORMANCE FOR AI AND HPC 4-way / 8-way H100 GPUs with 32 PetaFLOPs FP8 3.6 TFLOPs FP16 in-network SHARP Compute NVIDIA Certified High-Performance Offering from All Makers FASTEST, SCALABLE INTERCONNECT 4th Gen NVLINK with 3X faster All-Reduce communications 3.6 TB/s bisection bandwidth NVLINK Switch System Option Scales Up to 256 GPUs SECURE COMPUTING First HGX System with Confidential Computing

NVIDIA H100 PCIE Unprecedented Performance, Scalability, and Security for Mainstream
Servers FP8, FP16, TF32 performance include sparsity. X-factor compared to A100 HIGHEST AI AND HPC MAINSTREAM PERFORMANCE 3.2PF FP8 (5X) | 1.6PF FP16 (2.5X) | 800TF TF32 (2.5X) | 48TF FP64 (2.5X) 6X faster Dynamic Programming with DPX Instructions 2TB/s , 80GB HBM2e memory HIGHEST COMPUTE ENERGY EFFICIENCY Configurable TDP - 150W to 350W 2 Slot FHFL mainstream form factor HIGHEST UTILIZATION EFFICIENCY AND SECURITY 7 Fully isolated & secured instances, guaranteed QoS 2nd Gen MIG | Confidential Computing HIGHEST PERFORMING SERVER CONNECTIVITY 128GB/s PCI Gen5 600 GB/s GPU-2-GPU connectivity (5X PCIe Gen5) up to 2 GPUs with NVLink Bridge

ANNOUNCING H100 CNX CONVERGED ACCELERATOR Delivering High-Speed GPU-Network I/O to
Mainstream Servers Traditional Server Optimized for Accelerated Computing H100 CNX Memory CPU ConnectX-7 Gen-4 GPU Gen-5 Memory CPU ConnectX-7 Gen-4 GPU 350W | 80GB | 400 Gb/s Eth or IB PCIe Gen 5 within board and to host 2-Slot FHFL | NVLink

FASTER DATA SPEEDS WITH CONVERGED ARCHITECTURE Ideal for Mainstream Servers
H100 CNX Dedicated PCIe Gen5 link enables high transfer rate and predictable latency H100 CNX Host Server CPU H100 and ConnectX-7 Bandwidth limits and contention due to host PCIe system H100 Host Server CPU ConnectX-7 1x 400 or 2x 200 Gbps Ethernet or HDR InfiniBand 1x 400 or 2x 200 Gbps Ethernet or HDR InfiniBand Control and Data Plane Data Plane PCIe Gen5 Switch Other PCIe Devices Other PCIe Devices

NVIDIA GRACE

ANNOUNCING GRACE HOPPER CPU+GPU Designed for Giant Scale AI and
HPC 600GB Memory GPU for Giant Models New 900 GB/s Coherent Interface 30X Higher System Memory B/W to GPU In A Server Runs Nvidia Computing Stacks Available 1H 2023

ANNOUNCING GRACE CPU SUPERCHIP The Full Power of the Grace
HIGHEST CPU PERFORMANCE Superchip Design with 144 high-performance Armv9 Cores Estimated Specrate2017_int_base of over 740 HIGHEST MEMORY BANDWIDTH World’s first LPDDR5x memory with ECC, 1TB/s Memory Bandwidth HIGHEST ENERGY EFFICIENCY 2X Perf/Watt, CPU Cores + Memory in 500W 2X PACKING DENSITY 2x density of DIMM based designs RUNS FULL NVIDIA COMPUTING STACKS RTX, HPC, AI, Omniverse AVAILABLE 1H 2023

NVIDIA GPU AND CPU PORTFOLIO AVAILABILITY

PORTFOLIO AVAILABILITY Ampere Today, Hopper Coming Soon Training, HPC, Data
Analytics Inference, Mainstream Compute Today Q3 2022 Q4 2022 1H 2023 High Perf & Mainstream Converged Accelerators A30, A16, A10, A2 HGX H100 8-GPU Grace CPU Superchip Grace Hopper Superchip Giant AI and HPC Workloads High Performance and Power Efficient GPU + Arm GPU + x86 GPU + NIC Arm H100 CNX HGX H100 External NVLINK H100 PCIe HGX H100 4-GPU A100 A30X/A100X

DGX SYSTEMS

DELIVERING THE AI CENTER OF EXCELLENCE FOR ENTERPRISE Best of
Breed Infrastructure for AI Development Built on NVIDIA DGX ANNOUNCING NVIDIA DGX H100 ANNOUNCING DGX SuperPOD WITH DGX H100 1 ExaFLOPS of AI Performance in 32 Nodes Scale as large as needed in 32 node increments 4th Generation of the World’s Most Successful Platform Purpose-Built for Enterprise AI The World’s First AI System with NVIDIA H100 COMING LATE 2022 32 DGX H100 | 1 EFLOPS AI NVLINK SWITCH SYSTEM | QUANTUM-2 IB | 20TB HBM3 | 70 TB/s BISECTION B/W (11X) 8x NVIDIA H100 | 32 PFLOPS FP8 (6X) | 0.5 PFLOPS FP64 (3X) 640 GB HBM3 | 3.6 TB/s (1.5X) BISECTION B/W X-Factors compare performance over DGX SuperPOD with DGX A100 supercomputer configuration with same number of nodes

ANNOUNCING NVIDIA EOS SUPERCOMPUTER The World’s Most Advanced AI Infrastructure
Cloud Native | Performance Isolation | Multi-Tenant NVIDIA Eos DGX SuperPOD Powered by 576 DGX H100 Systems | 500 Quantum-2 IB Switches | 360 NVLink Switches FP8 18 EFLOPS 6X FP16 9 EFLOPS 3X FP64 275 PFLOPS 3X In-Network Compute 3.7 PFLOPS 36X Bisection Bandwidth 230 TB/s 2X NVLINK Domain 256 GPUs 32X Blueprint for OEM and Cloud Partner Offerings X-Factors compare performance over DGX A100 SuperPOD based supercomputer configuration with same number of Nodes

NVIDIA NETWORKING – QUANTUM PLATFORM

NVIDIA QUANTUM-2 IN-NETWORK COMPUTING QUANTUM-2 SWITCH SHARPv3 Small Message Data
Reductions SHARPv3 Large Message Data Reductions CONNECTX-7 INFINIBAND 16 Core / 256 Threads Datapath Accelerator Full Transport Offload and Telemetry Hardware-Based RDMA / GPUDirect MPI Tag Matching and All-to-All BLUEFIELD-3 INFINIBAND 16 Arm 64-Bit Cores 16 Core / 256 Threads Datapath Accelerator Full Transport Offload and Telemetry Hardware-Based RDMA / GPUDirect Computational Storage Security Engines MPI Tag Matching and All-to-All Optimized Multi-Tenant In-Network Computing

NVIDIA HPC PLATFORM

HPC+AI: MODULUS Framework for Developing Physics Machine Learning Models DEVELOP
HIGH FIDELITY SURROGATES ADOPTED BY LEADING RESEARCH INSTITUTIONS COLLABORATION PARTNERS NEW MODULUS FEATURES FNO, AFNO Accelerating Scientific Simulations Omniverse Integration Interactive visualization PyTorch Integration To support the growing ecosystem Available Early April Up to 4,000x speedup to simulate wind farms

CUQUANTUM Research the Computer of Tomorrow with the Most Powerful
Computer Today ENABLING SCIENTIFIC BREAKTHROUGHS 70X SPEEDUP ON CUQUANTUM APPLIANCE ANNOUNCING NEW RELEASES & EXPANDING ECOSYSTEM cuQuantum SDK GA Release cuQuantum DGX Appliance Beta Optimized Simulation Stack Integrations w/ leading quantum computing frameworks QC Ware: QChem Simulation on Perlmutter 68X 79X 88X Sycamore Circuit QFT Shor's Rigetti & NERSC: Quantum ML Climate Modeling on Perlmutter DGX Appliance gains over Dual Core CPU

GPU PROGRAMMING

NO MORE PORTING: CODING FOR GPUS WITH STANDARD C++, FORTRAN,
AND PYTHON [S41496]

SUMMARY

NVIDIA GPUS AT A GLANCE Fermi (2010) Kepler (2012) M2090
Maxwell (2014) Pascal (2016) Volta (2017) Turing (2018) Ampere (2020) K80 M40 M10 K1 P100 T4 V100 DC GPU RTX Quadro GeForce A100 A30 6000 K6000 M6000 P5000 GP100 GV100 RTX 8000 GTX 580 GTX 780 GTX 980 GTX 1080 TITAN Xp TITAN V RTX 2080 Ti RTX A6000 RTX 3090 Ti A40 A2 A16 Hopper (2022) TITAN RTX H100

DATA CENTER PRODUCT COMPARISON (MAR 2022) A100 A30 A40 H100¶
Performance FP64 9.7 TFLOPS 5.2 TFLOPS - 30 TFLOPS 24 TFLOPS FP64 Tensor Core 19.5 TFLOPS 10.3 TFLOPS - 60 TFLOPS 48 TFLOPS FP32 19.5 TFLOPS 10.3 TFLOPS 37.4 TFLOPS 60 TFLOPS 48 TFLOPS TF32 Tensor Core 156 | 312* TFLOPS 82 | 165* TFLOPS 74.8 | 149.6* TFLOPS 500 | 1000* TFLOPS 400 | 800* TFLOPS BFLOAT16 Tensor Core 312 | 624* TFLOPS 165 | 330* TFLOPS 149.7 | 299.4* TFLOPS 1000 | 2000* TFLOPS 800 | 1600* TFLOPS FP16 Tensor Core 312 | 624* TFLOPS 165 | 330* TFLOPS 149.7 | 299.4* TFLOPS 1000 | 2000* TFLOPS 800 | 1600* TFLOPS FP8 Tensor Core - - - 2000 | 4000* TFLOPS 1600 | 3200* TFLOPS Int8 Tensor Core 624 | 1248* TOPS 330 | 661* TOPS 299.3 | 598.6* TOPS 2000 | 4000* TOPS 1600 | 3200* TOPS Int4 Tensor Core 1248 | 2496* TOPS 661 | 1321* TOPS 598.7 | 1197.4* TOPS Form Factor SXM4 module on base board x16 PCIe Gen4 2 Slot FHFL 3 NVLink bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLink bridge x16 PCIe Gen4 2 Slot FHFL 1 NVLink bridge SXM5 PCIe Gen5 GPU Memory 40 GB HBM2 | 80 GB HBM2e 24 GB HBM2 48 GB GDDR6 80 GB HBM3 80 GB HBM2e GPU Memory Bandwidth 1555† | 2039‡ GB/s 1555† | 1935‡ GB/s 933 GB/s 696 GB/s 3 TB/s 2 TB/s Multi-Instance GPU Up to 7 MIGs @ 5† | 10‡ GB Up to 4 MIGs @ 6 GB - Up to 7 MIGs @ 10GB Interconnect BW (bidirectional) 600 GB/s (NVLink) 64 GB/s (PCIe Gen4) 600 GB/s (NVLink) 64 GB/s (PCIe Gen4) 200 GB/s (NVLink) 64 GB/s (PCIe Gen4) 112.5 GB/s (NVLink) 64GB/s (PCIe Gen4) 900 GB/s (NVLink) 128 GB/s (PCIe Gen5) 600 GB/s (NVLink) 128 GB/s (PCIe Gen5) Media Acceleration 1 JPEG decoder 5 Video decoders 1 JPEG decoder 4 video decoders 1 Video encoder 2 Video decoder (+AV1 decode) 7 JPEG decoders 7 Video decoders Ray Tracing No No Yes No Max thermal design power 400 W 250† | 300‡ W 165 W 300 W 700 W 350 W * With sparsity | † 40 GB HBM2 | ‡ 80 GB HBM2e | ¶ Preliminary specifications, may be subject to change

RESOURCES § NVIDIA GTC - https://www.nvidia.com/gtc/ § NVIDIA On-demand -
https://www.nvidia.com/en-us/on-demand/ § NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog - https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ § NVIDIA H100 Tensor Core GPU - https://www.nvidia.com/en-us/data-center/h100/ § NVIDIA Grace CPU - https://www.nvidia.com/en-us/data-center/grace-cpu/ § NVIDIA DGX H100 - https://www.nvidia.com/en-us/data-center/dgx-h100/ § NVIDIA Modulus - https://developer.nvidia.com/modulus § NVIDIA cuQuantum SDK - https://developer.nvidia.com/cuquantum-sdk § NVIDIA Quantum-2 InfiniBand Platform - https://www.nvidia.com/en-us/networking/quantum2/ § Developing Accelerated Code with Standard Language Parallelism | NVIDIA Technical Blog - https://developer.nvidia.com/blog/developing-accelerated-code-with-standard-language-parallelism/ §Multi-GPU Programming with Standard Parallel C++, Part 1 | NVIDIA Technical Blog - https://developer.nvidia.com/blog/multi-gpu-programming-with-standard-parallel-c-part-1/ §Multi-GPU Programming with Standard Parallel C++, Part 2 | NVIDIA Technical Blog - https://developer.nvidia.com/blog/multi-gpu-programming-with-standard-parallel-c-part-2/

GTC 2022 テクニカルフォローアップセミナー https://nvidia.connpass.com/event/242525/ §本セミナーでは、GTC 2022 にて発表された次世代アーキテクチャ Hopper (GPU)、Grace
(CPU) および BlueField (DPU) と、その性能を引き出すためのプログラミングモデルや SDK、ネットワーク関連のテクノロジーについて、NVIDIA エンジニア達がサマリーして紹介します。 § ⽇時 : 4 ⽉ 26 ⽇ (⽕) 10:00 - 16:30 § 会場 : オンライン (Zoom) § 主催 : エヌビディア合同会社 § 参加費 : 無料 (要申し込み) § 参加申込 : 参加申込ページから申し込み § 申込締切 : 4 ⽉ 25 ⽇ (⽉) 18:00 まで § Hopper アーキテクチャで、変わること、変わらないこと § 成瀬彰, Senior Developer Technology Engineer § NVIDIA HPC ソフトウエア斜め読み § 丹愛彦, HPC Solution Architect § HPC+AI ってよく聞くけど結局なんなの § ⼭崎和博, Deep Learning Solution Architect § データ爆発時代のネットワークインフラ § 愛甲浩史, Marketing Manager, Networking § 開発者が語る NVIDIA cuQuantum SDK § 森野慎也, Senior Math Libraries Engineer, Quantum Computing § NVIDIA Modulus: Physics ML 開発のためのフレームワーク § 丹愛彦, HPC Solution Architect § Magnum IO GPUDirect Storage 最新情報 § 佐々⽊邦暢, Senior Solution Architect

GTC 2022 Re:Cap / 2022-04-20 JAWS-UG-HPC

GTC 2022 Re:Cap / 2022-04-20 JAWS-UG-HPC

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript