組み込みから HPC まで ARM コアで実現するエコシステム / 2021-08-26 Arm Workshop

Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA 2021/08/26 組み込みから HPC
まで ARM コアで実現するエコシステム

2 Automotive NVIDIA Drive HPC NVIDIA Grace CPU Network NVIDIA
BlueField DPU AGENDA

3 “World’s Best Performing CEO” HARVARD BUSINESS REVIEW “100 Best
Companies to Work For” FORTUNE “Best Places to Work in 2021” GLASSDOOR “World’s Best CEOs” BARRON’S “50 Smartest Companies” MIT TECH REVIEW “Most Innovative Companies” FAST COMPANY Founded in 1993 Jensen Huang, Founder & CEO 19,000 Employees $16.7B in FY21

4 AUTOMOTIVE

5 TEGRA SOC GENERATIONS Tegra K1 Tegra X1 Parker CPU
Arm Cortex A15 (4 cores) Arm Cortex A57 (4 cores) NVIDIA Denver 2 (2 cores) Arm Cortex A57 (4 cores) GPU Kepler (192 cores) Maxwell (256 cores) Pascal (256 cores) Products NVIDIA SHIELD Tablet NVIDIA Jetson TK1 Google Chromebook Nintendo Switch Nintendo Switch Lite NVIDIA SHIELD TV NVIDIA Jetson TX1 NVIDIA Jetson Nano Tesla Mercedes Benz Magic Leap 1 NVIDIA Jetson TX2 Xavier Orin Atlan CPU NVIDIA Carmel (8 cores) Arm Cortex A78AE (12 cores) new Arm GPU Volta (384 cores) next generation next generation Products Toyota NVIDIA Jetson AGX Xavier NVIDIA Jetson Xavier NX NVIDIA DRIVE Pegasus Mercedes Benz Volvo (many automotive companies)

6 NVIDIA ORIN 24.5 billion transistors 12 A78 (Hercules) ARM64
CPUs 254 INT8 TOPS - CUDA Tensor Core GPU + DLA 205 GB/s memory bandwidth 4 10Gbps ENET 8K 30 Dec | 4K 60 Enc – H264 / H265 / VP9 4 R52 Lock-step Pairs Integrated Safety Island ASIL-D Secure key storage FUSA ASIL-B Chip | ASIL-D Systematic Advanced, Software-defined Platform for Autonomous Machines

7 NVIDIA DRIVE ATLAN Fusing Next Generation AI and BlueField
Industry’s First 1,000 TOPS SoC 400 Gbps Networking with Secure Gateway ASIL-D Safety Island TOPS is the New Horsepower

8 HYPERION 8 AV PLATFORM 2x Orin AV Computer 1x
Orin IX Computer 4x Orin + 4x MLNX 3D GT Data Recorder Sensor Suite: 8 Cameras [8MP], 4 Fisheyes [3MP], 3 In- Cabin, 9 Radar, 2 Lidar State-of-the-Art Advances for Data Collection, Development and Testing

9 THE FUTURE CAR IS SOFTWARE DEFINED

10 NVIDIA DRIVE AV

11 HPC

12 GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE Requires a
New Architecture GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec System Bandwidth Bottleneck DDR4 HBM2e GPU GPU GPU GPU x86 ELMo (94M) BERT-Large (340M) GPT-2 (1.5B) Megatron-LM (8.3B) T5 (11B) Turing-NLG (17.2B) GPT-3 (175B) 0.00001 0.0001 0.001 0.01 0.1 1 10 100 1000 2018 2019 2020 2021 2022 2023 Model Size (Trillions of Parameters) 100 TRILLION PARAMETER MODELS BY 2023

13 ANNOUNCING NVIDIA GRACE Breakthrough CPU Designed for Giant-Scale AI
and HPC Applications FASTEST INTERCONNECTS >900 GB/s Cache Coherent NVLink CPU To GPU (14x) >600GB/s CPU To CPU (2x) NEXT GENERATION ARM NEOVERSE CORES >300 SPECrate2017_int_base est. Availability 2023 HIGHEST MEMORY BANDWIDTH >500GB/s LPDDR5x w/ ECC >2x Higher B/W 10x Higher Energy Efficiency

14 TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING CURRENT x86 ARCHITECTURE DDR4
HBM2e Evolving Architecture For New Workloads INTEGRATED CPU-GPU ARCHITECTURE LPDDR5x HBM2e 3 DAYS FROM 1 MONTH Fine-Tune Training of 1T Model REAL-TIME INFERENCE ON 0.5T MODEL Interactive Single Node NLP Inference GPU GPU GPU GPU GRACE GRACE GRACE GRACE GPU GPU GPU GPU x86 Transfer 2TB in 30 secs Transfer 2TB in 1 secs GPU 8,000 GB/sec CPU 200 GB/sec PCIE Gen4 (Effective Per GPU) 16 GB/sec Mem-to-GPU 64 GB/sec GPU 8,000 GB/sec CPU 500 GB/sec NVLink 500 GB/sec Mem-to-GPU 2000 GB/sec Bandwidth claims rounded to nearest hundred for illustration. Performance results based on projections on these configurations Grace : 8xGrace and 8xA100 with 4th Gen NVIDIA NVLink Connection between CPU and GPU and x86: DGX A100. Training: 1 Month of training is Fine-Tuning a 1T parameter model on a large custom data set on 64xGrace+64xA100 compared to 8xDGXA100 (16xX86+64xA100) Inference: 530B Parameter model on 8xGrace+8xA100 compared to DGXA100.

15 ANNOUNCING THE WORLD’S FASTEST SUPERCOMPUTER FOR AI 20 Exaflops
of AI Accelerated w/ NVIDIA Grace CPU and NVIDIA GPU HPC and AI For Scientific and Commercial Apps Advance Weather, Climate, and Material Science

16 NETWORK

17 INTRODUCING NVIDIA BLUEFIELD-3 DPU First 400Gb/s Data Processing Unit
Offloads and Accelerates Data Center Infrastructure Isolates Application from Control and Management Plane Powerful CPU – 16x Arm A78 Cores Datapath Accelerator – 16x Cores, 256 Threads Process Networking, Storage, and Security at 400 Gbps

18 INTRODUCING NVIDIA BLUEFIELD-3 DPU First 400Gb/s Data Processing Unit
22 Billion Transistors 400Gb/s Ethernet & InfiniBand Connectivity 400Gb/s Crypto Acceleration 18M IOP/s Elastic Block Storage 300 Equivalent x86 Cores CONNECTX-7 DATA PATH ACCELERATOR PCIe GEN 5.0 DDR5 MEMORY INTERFACE ARM CORES ACCELERATION ENGINES

19 BLUEFIELD DPU GENERATIONS BlueField BlueField-2 BlueField-3 Port speed 2
x 100Gb/s InfiniBand and Ethernet 2 x 100Gb/s, 1 x 200Gb/s InfiniBand and Ethernet 1 x 400Gb/s, 2x200Gb/s InfiniBand and Ethernet Performance Bandwidth: 200Gb/s DPDK Max Msg Rate:150Mpps RDMA max msg rate: 200Mpps Bandwidth: 200Gb/s DPDK Max Msg Rate: 215Mpps RDMA max msg rate: 215Mpps Bandwidth: 400Gb/s DPDK max msg rate: 250Mpps RDMA max msg rate: 330Mpps Modulation NRZ NRZ & 50G PAM4 NRZ & 100G PAM4 DDR Channels DDR4-2400MT/s Dual channels DDR4-3200MT/s Single channel 2 x DDR5-5600 Interfaces Max Arm Cores 16 x A72 Arm cores 8 x A72 Arm cores 16 x A78 Arm cores (Hercules) Embedded ASIC ConnectX-5 ConnectX-6 Dx ConnectX-7 PCIe Gen3.0 x32 / Gen4.0 x16 Gen4.0 x16 Gen5.0 x32

20 NVIDIA DOCA Enabling Broad BlueField Partner Ecosystem Software Development
Framework for BlueField DPUs Offload, Accelerate, and Isolate Infrastructure Processing Support for Hyperscale, Enterprise, Supercomputing and Hyperconverged Infrastructure Software Compatibility for Generations of BlueField DPUs DOCA is for DPUs what CUDA is for GPUs CYBER SECURITY EDGE STORAGE PLATFORM INFRASTRUCTURE ORCHESTRATION MANAGEMENT TELEMETRY SECURITY NETWORKING STORAGE ACCELERATION LIBRARIES DOCA

21 BLUEFIELD-3 USE CASES Unprecedented Innovation for Modern Data Centers
Cloud Computing Bare-Metal I Virtualized I Containerized Private I Public I Hybrid Cloud Cyber Security Distributed Security | NGFW I Micro-segmentation HPC & AI Cloud-Native Supercomputing | Accelerated DLRM Telco & Edge Telco Cloud | CloudRAN | Edge Compute Media Streaming Visual High Quality I 8K Video I CDN Data Storage HCI I Elastic Block Storage I Instance Storage

22 BLUEFIELD ENABLES CLOUD-NATIVE SUPERCOMPUTING Collective offload with UCC accelerator
Smart MPI progression User-defined algorithms 1.4X higher application performance Multi-Tenancy with Zero-Trust Security

23 NVIDIA DPU ROADMAP Exponential Growth in Data Center Infrastructure
Processing 2020 2022 1X 10X 100X BlueField-2 7B Transistors 9 SPECint 0.7 TOPS 200 Gbps BlueField-3 22B Transistors 42 SPECint 1.5 TOPS 400 Gbps BlueField-4 64B Transistors 160 SPECint 1000 TOPS 800 Gbps 2024 DOCA — ONE ARCHITECTURE * BlueField-4 product to include opt-in GPU and non-GPU configurations

24 SUMMARY

25 3 CHIPS. YEARLY LEAPS. ONE ARCHITECTURE.

26 MEGATRON cuQUANTUM MORPHEUS MERLIN MAXINE CLARA METROPOLIS ISAAC DRIVE
AERIAL APPLICATION FRAMEWORKS CHIPS & SYSTEMS PLATFORM SOFTWARE NVIDIA ECOSYSTEM PLATFORM AGX DGX RTX EGX HGX FLEET COMMAND OMNIVERSE AI ENTERPRISE VGPU GPU CPU DPU GEFORCE RIVA

27 SUMMARY • Tegra SoC has a long history, and
that experience has been applied to current Xavier, the next generation Orin, and the next generation Atlan • The future car is software defined, and NVIDIA provide whole ecosystem such as DRIVE Hyperion and DGX Systems • Grace CPU is designed for giant-scale AI and HPC applications • BlueField-3 DPU is the first 400 Gb/s data processing unit • DOCA enables broad BlueField ecosystem • GPU, CPU and DPU chips make a yearly leaps in one architecture

組み込みから HPC まで ARM コアで実現するエコシステム / 2021-08-26 Ar...

組み込みから HPC まで ARM コアで実現するエコシステム / 2021-08-26 Arm Workshop

Shinnosuke Furuya

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript

Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA 2021/08/26 組み込みから HPC

2 Automotive NVIDIA Drive HPC NVIDIA Grace CPU Network NVIDIA

3 “World’s Best Performing CEO” HARVARD BUSINESS REVIEW “100 Best

4 AUTOMOTIVE

5 TEGRA SOC GENERATIONS Tegra K1 Tegra X1 Parker CPU

6 NVIDIA ORIN 24.5 billion transistors 12 A78 (Hercules) ARM64

7 NVIDIA DRIVE ATLAN Fusing Next Generation AI and BlueField

8 HYPERION 8 AV PLATFORM 2x Orin AV Computer 1x

9 THE FUTURE CAR IS SOFTWARE DEFINED

10 NVIDIA DRIVE AV

11 HPC

12 GIANT MODELS PUSHING LIMITS OF EXISTING ARCHITECTURE Requires a

13 ANNOUNCING NVIDIA GRACE Breakthrough CPU Designed for Giant-Scale AI

14 TURBOCHARGED TERABYTE SCALE ACCELERATED COMPUTING CURRENT x86 ARCHITECTURE DDR4

15 ANNOUNCING THE WORLD’S FASTEST SUPERCOMPUTER FOR AI 20 Exaflops

16 NETWORK

17 INTRODUCING NVIDIA BLUEFIELD-3 DPU First 400Gb/s Data Processing Unit

18 INTRODUCING NVIDIA BLUEFIELD-3 DPU First 400Gb/s Data Processing Unit

19 BLUEFIELD DPU GENERATIONS BlueField BlueField-2 BlueField-3 Port speed 2

20 NVIDIA DOCA Enabling Broad BlueField Partner Ecosystem Software Development

21 BLUEFIELD-3 USE CASES Unprecedented Innovation for Modern Data Centers

22 BLUEFIELD ENABLES CLOUD-NATIVE SUPERCOMPUTING Collective offload with UCC accelerator

23 NVIDIA DPU ROADMAP Exponential Growth in Data Center Infrastructure

24 SUMMARY

25 3 CHIPS. YEARLY LEAPS. ONE ARCHITECTURE.

26 MEGATRON cuQUANTUM MORPHEUS MERLIN MAXINE CLARA METROPOLIS ISAAC DRIVE

27 SUMMARY • Tegra SoC has a long history, and