GPU コンピューティングが支える HPC と AI / 2019-10-23 Dell Technologies Forum 2019

Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA 10/23/2019 GPU ÎûèóýÚÀûË>¸
HPC AI

2 ÄàçÛÀ¿ AI ÎûèóýÚÀûËÇûåßý 1993 CEO 12,000 2018 97 1600

3 #BV° #Fy ÇÚÏöÁÕ Carsales 20,000 Carsales Cyclops GPU AI
Cyclops 55

4 GPU ¬s¢¹¥ ÔîýÜq% Blue River Technologies (John Deere )
GPU Blue River Technology See & Spray 90%

5 CT scans are increasingly obtained today to diagnose lung
cancer, adding to an already unmanageable workload for radiologists. Further, very small pulmonary nodes are difficult to spot with the human eye. Powered by NVIDIA GPUs on the NVIDIA Clara platform, 12 Sigma Technologies’ σ-Discover/Lung system automatically detects lung nodules as small as .01% of an image, analyzes malignancy with >90% accuracy and provides a decision support tool to radiologists. When optimized on an NVIDIA T4 cluster the system runs up to 18x faster. »°a(

6 TESLA PLATFORM

7 ! NVIDIA GPU bU (z) Maxwell (2014) Pascal (2016)
Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000

Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Fermi (FP64)

Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Pascal FP16 FP16 FP32 2

Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Volta / Turing FP16 FP32 Tensor Tesla V100 FP32: 15.7 TFLOPS Tensor 125 TFLOPS (8 ) Turing INT8 INT4 Tesla T4 INT4 260 TOPS

13 NVIDIA DGX SYSTEMS DGX-2 DGX Station DGX-1 2 PFLOPS
(Mixed Precision) 500 TFLOPS (Mixed Precision) 1 PFLOPS (Mixed Precision) 16x Tesla V100 4x Tesla V100 8x Tesla V100 NVSwitch NVLink NVLink 8x IB EDR | 2x 100GbE 2x 10GbE 4x IB EDR | 2x 10GbE

14 VOLTA HPC ,Mº 4 P100 HPC System Config Info:
2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware. Summit Supercomputer 200 PetaFlops 4,608 Nodes

15 VOLTA ÛÀýëöýßûË,Mºw¯4 P100 V100 P100 V100 Images per Second
Images per Second 2.4x faster 3.7x faster FP32 Tensor FP16 Tensor ÜùýßûË Áûé¾ùûÔ TensorRT - 7ms Latency (*) DL ResNet50

16 TENSOR Î¿ D = FP32 (FP16) FP16 FP16 FP32
(FP16) A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3 B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3 C0,0 C0,1 C0,2 C0,3 C1,0 C1,1 C1,2 C1,3 C2,0 C2,1 C2,2 C2,3 C3,0 C3,1 C3,2 C3,3 FMA (Fused Multiply-Add) 4x4 1 : 128 / /Tensor 1024 / /SM A B C

17 HPC AI °< {"\fºg AI > j°Ûý×¶å×ýûº"»¦ßóýöø áÙÜúýÊ
> @°4ùÔíûÔ°s HPC > 5'¯Dª ¿øÏ÷Õð > K¢¹¥ñÛø¬2R®CP 90% Prediction Accuracy Publish in Nature April 2019 Tensor Cores Achieved 1.13 EF 2018 Gordon Bell Winner Orders Of Magnitude Speedup 3M New Compounds In 1 Day Time-to-solution Reduced From Weeks To 2 Hours

18 {"\f?T¬° TENSOR Î¿°Q 7.8 15.7 125 0 20 40
60 80 100 120 140 V100 TFLOPS Tesla V100 °W,M v< °ëöÕî; (ÆýÊ÷ÙÓit-) FP16 Solver 3.5x faster :°ÒïóùýÒõû ( )"it-) FP16-FP21-FP32-FP64 25x faster T¯µ¸/d@ (ÆÙÊÔéÅýÝ") FP16/FP32/FP64 4x faster

19 HPL-AI LlA^Öøäý SUMMIT ¬° HPL-AI 3 °,MºGp HPL-AI:
AI H HPC AI °< HPC (Simulation) – FP64 AI (Machine Learning) – FP16, FP32 Tensor Î¿ GPU ºX¸ Summit ¬ 3 °,M FP64 (HPL) Mixed Precision (HPL-AI) 149 PF 445 PF Proposed by Prof Jack Dongarra, et al

20 ü¤¸ñÛø 2016 - Baidu Deep Speech 2 2015 -
Microsoft ResNet 2017 - Google NMT 105 ExaFLOPS 1GPU 1 20 ExaFLOPS 1GPU 2.5 7 ExaFLOPS 1GPU 1

21 GPU2 " GPU1 GPU1 GPU2 • GPU • GPU
• • GPU • GPU

22 " (Ûý×}) °&] Processor DL framework Time Microsoft Tesla
P100 x8 Caffe 29 hours Facebook Tesla P100 x256 Caffe2 1 hour Google TPUv2 x256 TensorFlow 30 mins PFN Tesla P100 x1024 Chainer 15 mins Tencent Tesla P40 x2048 TensorFlow 6.6 mins SONY Tesla V100 x2176 NNL 3.7 mins Google TPUv3 x1024 TensorFlow 2.2 mins Tesla V100 x2048 MxNet 75 sec ImageNet + ResNet50

23 NGC

25 1®åéÅýîûÔA| H 0 2000 4000 6000 8000 10000 12000
18.02 18.09 19.02 Images/Second MxNet Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100 0 50000 100000 150000 200000 250000 300000 350000 400000 18.05 18.09 19.02 Tokens/Second PyTorch 0 1000 2000 3000 4000 5000 6000 7000 8000 18.02 18.09 19.02 Images/Second TensorFlow Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100 Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19 x 2x 4x 6x 8x 10x 12x 14x 16x 18x Mar '18 Nov '18 Mar '19 HPC Applications

26 £ NGC HPC GPU NGC 50 DL, ML, HPC
Innovate Faster Deploy Anywhere Simplify Deployments

27 NGC ñÛøÔÊ÷ëÜ Tensor 18 • Tensor • AMP •
NADIA H • Tensor SOTA • • : • NVIDIA NGC https://ngc.nvidia.com/catalog/model-scripts • GitHub https://www.github.com/NVIDIA/deeplearningexamples • NVIDIA NGC Framework containers https://ngc.nvidia.com/catalog/containers

28 r®ñÛøÔÊ÷ëÜ https://developer.nvidia.com/deep-learning-examples Computer Vision Speech & NLP Recommender Systems
• SSD PyTorch • SSD TensorFlow • UNET-Industrial TensorFlow • UNET-Medical TensorFlow • ResNet-50 v1.5 MXNet • ResNet-50 PyTorch • ResNet-50 TensorFlow • Mask R-CNN PyTorch • GNMT v2 TensorFlow • GNMT v2 PyTorch • Transformer PyTorch • BERT (Pre-training and Q&A) TensorFlow • NCF PyTorch • NCF TensorFlow Text to Speech • Tacotron2 and WaveGlow PyTorch

29 NGC "=³ñÛøé¾Áø • : TensorRT, TensorFlow, PyTorch, MXNet •
: ImageNet, MSCOCO, LibreSpeech, Wikipedia/BookCorpus, • : FP32, FP16, and INT8 • Key customer benefits:

30 NGC CONTAINER REPLICATOR • NGC • | | •
H • • Singularity • Github | How-to guide H Docker Singularity v1 v1 v1 NGC Cron job Time v1 v1 v1 v1 v2 v1 v1 Cron job v2 NGC v1 v1 v2 v1 v1 v1 v2 NGC

31 I¸ NGC °6 200 800 Singularity ¬ ¢¹
«¸ 10 *3°Óõê° 80% ¬NGC °ÎûÚÞ O ¢¹«²¤ https://blogs.nvidia.co.jp/2019/06/19/abci-adopts-ngc/

32 NGC ÎûÚÞý ôýÑýÈÁÝ https://www.nvidia.com/content/dam/en-zz/ja/Solutions/cloud/NGC-User-Guide_JA.pdf

33 AI D°÷é¾ùûÔ ¿ýÉÚÊØò

34 AI _¯¡¸h8± ~N®Dke°`Hm AI S9°_¯µ· 6oZ 15% 4 40%
°E AI _°h8£« AI ¯~£¥D$Xº+8c source: 2018 CTA Market Research

35 ÔÌýø[M® AI Dke öÙÊ.? áÙÜúýÊ ÔÜùýÓ é¾Ò÷ÚÀ ÖéÜÂÃ¿ •
DL • HPC • Ethernet / IB based fabric • 100Gbps inter- connect • • • • IOPS • • per Watt DC • n : • = 1TB / hr • : 500 PB • RN50: 113 • P : 7 • : 6 = 97

36 NVIDIA DGX POD™ • NVIDIA® DGX-1™ • NVIDIA DGX
SATURNV • C • : • NVIDIA DGX-2™ • POD • C AI

37 ®Tº0¤¸÷é¾ùûÔ¿ýÉÚÊØò DGX-1 Servers 9 J • Tesla V100 GPUs
x 8pcs • NVIDIA. GPUDirect™ over RDMA support • Run at MaxQ • 100 GbE networking (up to 4 x 100 GbE) ÔÜùýÓâýÝ 12 J • 192 GB RAM • 3.8 TB SSD • 100 TB HDD (1.2 PB Total HDD) • 50 GbE networking áÙÜúýÊ • In-rack: 100 GbE to DGX-1 servers • In-rack: 50 GbE to storage nodes • Out-of-rack: 4 x 100 GbE (up to 8) öÙÊ • 35 kW Power • 42U x 1200 mm x 700 mm (minimum) • Rear Door Cooler C 4 POD DGX-1 POD • NVIDIA DGX POD • • SATURNV

38 DGX POD — DGX-1 Reference Architecture in a Single
35 kW High-Density Rack Fit within a standard-height 42 RU data center rack • Nine DGX-1 servers (9 x 3 RU = 27 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU) In real-life DL application development, one to two DGX-1 servers per developer are often required One DGX POD supports five developers (AV workload) Each developer works on two experiments per day One DGX-1/developer/experiment/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment

39 DGX POD — DGX-2 Reference Architecture in a Single
35 kW High-Density Rack Fit within a standard-height 48 RU data center rack • Three DGX-2 servers (3 x 10 RU = 30 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU) In real-life DL application development, one DGX-2 per developer minimizes model training time One DGX POD supports at least three developers (AV workload) Each developer works on two experiments per day One DGX-2/developer/2 experiments/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment

40 NVIDIA DGX-READY SOLUTION PARTNERS NVIDIA Partners Ready to Build
Your DGX SuperPODs

41 NVIDIA DGX SUPERPOD AI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP
Test Bed for Highest Performance Scale-Up Systems • 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list • <2 mins To Train RN-50 Modular & Scalable GPU SuperPOD Architecture • Built in 3 Weeks • Optimized For Compute, Networking, Storage & Software Integrates Fully Optimized Software Stacks • Freely Available Through NGC • 96 DGX-2H • 10 Mellanox EDR IB per node • 1,536 V100 Tensor Core GPUs • 1 megawatt of power Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC

42 NVIDIA DGX SUPERPOD Mellanox EDR 100G InfiniBand Network Mellanox
Smart Director Switches In-Network Computing Acceleration Engines Fast and Efficient Storage Access with RDMA Up to 130Tb/s Switching Capacity per Switch Ultra-Low Latency of 300ns Integrated Network Manager Terabit-Speed InfiniBand Networking per Node … Rack 1 Rack 16 Compute Backplane Switch Storage Backplane Switch 64 DGX-2 GPFS 200 Gb/s per node 800 Gb/s per node

43 MLPERF 2019 NVIDIA DGX SUPERPOD BREAKS AI RECORDS Record
Type Benchmark Record Max Scale (Minutes to Train) Object Detection (Heavy Weight) Mask R-CNN 18.47 Mins Translation (Recurrent) GNMT 1.8 Mins Reinforcement Learning (MiniGo) 13.57 Mins Per Accelerator (Hours to Train) Object Detection (Heavy Weight) Mask R-CNN 25.39 Hrs Object Detection (Light Weight) SSD 3.04 Hrs Translation (Recurrent) GNMT 2.63 Hrs Translation (Non-recurrent) Transformer 2.61 Hrs Reinforcement Learning (MiniGo) 3.65 Hrs Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used| MLPerf ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10

44 DGX SUPERPOD AI ÖéÜÂÃ¿ Ô×ÙÊ DGX SuperPOD Multi-Node NVIDIA
GPU Containers Docker NVIDIA GPU Cloud Applications Kubernetes & Slurm DGX SuperPOD Mgmt Software Cluster Mgmt, Orchestration, Workload Scheduler GPU Enabled

45 §´¨u

46 SNS Facebook: NVIDIA AI Japan | https://www.facebook.com/NVIDIAAI.JP Twitter: @NVIDIAAIJP
| https://twitter.com/NVIDIAAIJP Follow and Like us!

48 WEBINAR Volta / Turing Tensor Automatic Mixed Precision (AMP)
AUTOMATIC MIXED PRECISION https://info.nvidia.com/jp-amp-webinar-reg-page.html

49 WEBINAR HPC OpenACC GPU OpenACC OpenACC OpenACC GPU -
- https://info.nvidia.com/intro-openacc-jp-reg-page.html

GPU コンピューティングが支える HPC と AI / 2019-10-23 Dell Te...

GPU コンピューティングが支える HPC と AI / 2019-10-23 Dell Technologies Forum 2019

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript