Upgrade to Pro — share decks privately, control downloads, hide ads and more …

GPU コンピューティングが支える HPC と AI / 2019-10-23 Dell Technologies Forum 2019

GPU コンピューティングが支える HPC と AI / 2019-10-23 Dell Technologies Forum 2019

Shinnosuke Furuya

October 23, 2019
Tweet

More Decks by Shinnosuke Furuya

Other Decks in Technology

Transcript

  1. 5 CT scans are increasingly obtained today to diagnose lung

    cancer, adding to an already unmanageable workload for radiologists. Further, very small pulmonary nodes are difficult to spot with the human eye. Powered by NVIDIA GPUs on the NVIDIA Clara platform, 12 Sigma Technologies’ σ-Discover/Lung system automatically detects lung nodules as small as .01% of an image, analyzes malignancy with >90% accuracy and provides a decision support tool to radiologists. When optimized on an NVIDIA T4 cluster the system runs up to 18x faster. –ž»°a(
  2. 7 Š! NVIDIA GPU bU (z˜) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000
  3. 8 Š! NVIDIA GPU bU (z˜) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Fermi (FP64)
  4. 9 Š! NVIDIA GPU bU (z˜) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Pascal FP16 FP16 FP32 2
  5. 10 Š! NVIDIA GPU bU (z˜) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC GRID DL M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Volta / Turing FP16 FP32 Tensor Tesla V100 FP32: 15.7 TFLOPS Tensor 125 TFLOPS (8 ) Turing INT8 INT4 Tesla T4 INT4 260 TOPS
  6. 11 NVIDIA TESLA V100 210 | TSMC 12nm FFN |

    815mm2 5,120 CUDA | 640 Tensor FP64 (Double) 7.8 TFLOPS | FP32 (Single) 15.7 TFLOPS FP16/FP32 (Mixed) 125 TFLOPS 20MB | 16MB 900 GB/s 32GB HBM2 300 GB/s NVLink AI HPC Tensor Volta
  7. 12 TESLA T4 320 Turing Tensor 2,560 CUDA FP32 (Single)

    8.1 TFLOPS | FP16/FP32 (Mixed) 65 TFLOPS | INT8 130 TOPS | INT4 260 TOPS 16GB 320GB/s GDDR6 70 W & | HPC | | GPU
  8. 13 NVIDIA DGX SYSTEMS DGX-2 DGX Station DGX-1 2 PFLOPS

    (Mixed Precision) 500 TFLOPS (Mixed Precision) 1 PFLOPS (Mixed Precision) 16x Tesla V100 4x Tesla V100 8x Tesla V100 NVSwitch NVLink NVLink 8x IB EDR | 2x 100GbE 2x 10GbE 4x IB EDR | 2x 10GbE
  9. 14 VOLTA HPC ,MºŸ 4 P100 HPC System Config Info:

    2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware. Summit Supercomputer 200 PetaFlops 4,608 Nodes
  10. 15 VOLTA ÛÀýëöýßûË,Mºw¯4 P100 V100 P100 V100 Images per Second

    Images per Second 2.4x faster 3.7x faster FP32 Tensor FP16 Tensor ÜùýßûË Áûé¾ùûÔ TensorRT - 7ms Latency (*) DL ResNet50
  11. 16 TENSOR ο D = FP32 (FP16) FP16 FP16 FP32

    (FP16) A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3 B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3 C0,0 C0,1 C0,2 C0,3 C1,0 C1,1 C1,2 C1,3 C2,0 C2,1 C2,2 C2,3 C3,0 C3,1 C3,2 C3,3 FMA (Fused Multiply-Add) 4x4 1 : 128 / /Tensor 1024 / /SM A B C
  12. 17 HPC ­ AI °< ž{"\fºg AI > j°Ûýם¶å×ýûº"»¦ßóýöø áÙÜúýÊ

    > @‰°4­ùÔíûÔ°s HPC > 5'¯Dª ¿øÏ÷Õð > K¢¹¥ñÛø¬2R®CP 90% Prediction Accuracy Publish in Nature April 2019 Tensor Cores Achieved 1.13 EF 2018 Gordon Bell Winner Orders Of Magnitude Speedup 3M New Compounds In 1 Day Time-to-solution Reduced From Weeks To 2 Hours
  13. 18 {"\f?T¬° TENSOR ο°Q 7.8 15.7 125 0 20 40

    60 80 100 120 140 V100 TFLOPS Tesla V100 °W,M v< ”°ëöÕî†; (ÆýÊ÷ÙÓit-) FP16 Solver 3.5x faster :°‘ÒïóùýÒõû ( )"‘it-) FP16-FP21-FP32-FP64 25x faster ˆ T¯µ¸/d@‰ (ÆÙÊÔéÅýÝ") FP16/FP32/FP64 4x faster
  14. 19 HPL-AI ­ LlA^Öøäý SUMMIT ¬° HPL-AI ž3 „°,MºGp HPL-AI:

    AI H HPC ­ AI °< HPC (Simulation) – FP64 AI (Machine Learning) – FP16, FP32 Tensor ο GPU ºX›¸ Summit ¬ 3 „°,M FP64 (HPL) Mixed Precision (HPL-AI) 149 PF 445 PF Proposed by Prof Jack Dongarra, et al
  15. 20 “ü…¤¸ñÛø 2016 - Baidu Deep Speech 2 2015 -

    Microsoft ResNet 2017 - Google NMT 105 ExaFLOPS 1GPU 1 20 ExaFLOPS 1GPU 2.5 7 ExaFLOPS 1GPU 1
  16. 22 Œ"€ (Ûý×}‹) °&] Processor DL framework Time Microsoft Tesla

    P100 x8 Caffe 29 hours Facebook Tesla P100 x256 Caffe2 1 hour Google TPUv2 x256 TensorFlow 30 mins PFN Tesla P100 x1024 Chainer 15 mins Tencent Tesla P40 x2048 TensorFlow 6.6 mins SONY Tesla V100 x2176 NNL 3.7 mins Google TPUv3 x1024 TensorFlow 2.2 mins Tesla V100 x2048 MxNet 75 sec ImageNet + ResNet50
  17. 24 NGC: GPU YÖéÜÂÿ ãê GPU 50 ÛÀýëöýßûË ƒ°%•"€¿øÏ÷Õð HPC

    [c xW ÍâïÊÔ NAMD | GROMACS | more RAPIDS | H2O | more TensorRT | DeepStream | more Parabricks ParaView | IndeX | more TensorFlow | PyTorch | more
  18. 25 ‚1®åéÅýîûÔA| H 0 2000 4000 6000 8000 10000 12000

    18.02 18.09 19.02 Images/Second MxNet Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100 0 50000 100000 150000 200000 250000 300000 350000 400000 18.05 18.09 19.02 Tokens/Second PyTorch 0 1000 2000 3000 4000 5000 6000 7000 8000 18.02 18.09 19.02 Images/Second TensorFlow Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100 Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19 x 2x 4x 6x 8x 10x 12x 14x 16x 18x Mar '18 Nov '18 Mar '19 HPC Applications
  19. 26 £š NGC HPC GPU NGC 50 DL, ML, HPC

    Innovate Faster Deploy Anywhere Simplify Deployments
  20. 27 NGC ñÛøÔÊ÷ëÜ Tensor 18 • Tensor • AMP •

    NADIA H • Tensor SOTA • • : • NVIDIA NGC https://ngc.nvidia.com/catalog/model-scripts • GitHub https://www.github.com/NVIDIA/deeplearningexamples • NVIDIA NGC Framework containers https://ngc.nvidia.com/catalog/containers
  21. 28 r®ñÛøÔÊ÷ëÜ https://developer.nvidia.com/deep-learning-examples Computer Vision Speech & NLP Recommender Systems

    • SSD PyTorch • SSD TensorFlow • UNET-Industrial TensorFlow • UNET-Medical TensorFlow • ResNet-50 v1.5 MXNet • ResNet-50 PyTorch • ResNet-50 TensorFlow • Mask R-CNN PyTorch • GNMT v2 TensorFlow • GNMT v2 PyTorch • Transformer PyTorch • BERT (Pre-training and Q&A) TensorFlow • NCF PyTorch • NCF TensorFlow Text to Speech • Tacotron2 and WaveGlow PyTorch
  22. 29 NGC "€=³ñÛøé¾Áø • : TensorRT, TensorFlow, PyTorch, MXNet •

    : ImageNet, MSCOCO, LibreSpeech, Wikipedia/BookCorpus, • : FP32, FP16, and INT8 • Key customer benefits:
  23. 30 NGC CONTAINER REPLICATOR • NGC • | | •

    H • • Singularity • Github | How-to guide H Docker Singularity v1 v1 v1 NGC Cron job Time v1 v1 v1 v1 v2 v1 v1 Cron job v2 NGC v1 v1 v2 v1 v1 v1 v2 NGC
  24. 31 Iž¸ NGC °6 200 800  Singularity ¬ ¢¹

    «š¸ 10 *3°Óõê° 80% ¬™NGC °ÎûÚޞ O ¢¹«š²¤ https://blogs.nvidia.co.jp/2019/06/19/abci-adopts-ngc/
  25. 34 AI _¯œ¡¸h8± ~N®D‡ke°`Ž­Hm AI S9°_¯µ· 6oZž 15% 4 40%

    °Ež AI _°h8­£« AI ¯~£¥D‡$Xº+8c source: 2018 CTA Market Research
  26. 35 ÔÌýø[M® AI D‡ke öÙÊ.? áÙÜúýÊ ÔÜùýÓ é¾Ò÷ÚÀ ÖéÜÂÿ •

    DL • HPC • Ethernet / IB based fabric • 100Gbps inter- connect • • • • IOPS • • per Watt DC • n : • = 1TB / hr • : 500 PB • RN50: 113 • P : 7 • : 6 = 97
  27. 36 NVIDIA DGX POD™ • NVIDIA® DGX-1™ • NVIDIA DGX

    SATURNV • C • : • NVIDIA DGX-2™ • POD • C AI
  28. 37 ®Tº0¤¸÷é¾ùûÔ¿ýÉÚÊØò DGX-1 Servers 9 J • Tesla V100 GPUs

    x 8pcs • NVIDIA. GPUDirect™ over RDMA support • Run at MaxQ • 100 GbE networking (up to 4 x 100 GbE) ÔÜùýÓâýÝ 12 J • 192 GB RAM • 3.8 TB SSD • 100 TB HDD (1.2 PB Total HDD) • 50 GbE networking áÙÜúýÊ • In-rack: 100 GbE to DGX-1 servers • In-rack: 50 GbE to storage nodes • Out-of-rack: 4 x 100 GbE (up to 8) öÙÊ • 35 kW Power • 42U x 1200 mm x 700 mm (minimum) • Rear Door Cooler C 4 POD DGX-1 POD • NVIDIA DGX POD • • SATURNV
  29. 38 DGX POD — DGX-1 Reference Architecture in a Single

    35 kW High-Density Rack Fit within a standard-height 42 RU data center rack • Nine DGX-1 servers (9 x 3 RU = 27 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU) In real-life DL application development, one to two DGX-1 servers per developer are often required One DGX POD supports five developers (AV workload) Each developer works on two experiments per day One DGX-1/developer/experiment/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment
  30. 39 DGX POD — DGX-2 Reference Architecture in a Single

    35 kW High-Density Rack Fit within a standard-height 48 RU data center rack • Three DGX-2 servers (3 x 10 RU = 30 RU) • Twelve storage servers (12 x 1 RU = 12 RU) • 10 GbE (min) storage and management switch (1 RU) • Mellanox 100 Gbps intra- rack high speed network switches (1 or 2 RU) In real-life DL application development, one DGX-2 per developer minimizes model training time One DGX POD supports at least three developers (AV workload) Each developer works on two experiments per day One DGX-2/developer/2 experiments/day* *300,000 0.5M images * 120 epochs @ 480 images/sec Resnet-18 backbone detection network per experiment
  31. 41 NVIDIA DGX SUPERPOD AI LEADERSHIP REQUIRES AI INFRASTRUCTURE LEADERSHIP

    Test Bed for Highest Performance Scale-Up Systems • 9.4 PF on HPL | ~200 AI PF | #22 on Top500 list • <2 mins To Train RN-50 Modular & Scalable GPU SuperPOD Architecture • Built in 3 Weeks • Optimized For Compute, Networking, Storage & Software Integrates Fully Optimized Software Stacks • Freely Available Through NGC • 96 DGX-2H • 10 Mellanox EDR IB per node • 1,536 V100 Tensor Core GPUs • 1 megawatt of power Autonomous Vehicles | Speech AI | Healthcare | Graphics | HPC
  32. 42 NVIDIA DGX SUPERPOD Mellanox EDR 100G InfiniBand Network Mellanox

    Smart Director Switches In-Network Computing Acceleration Engines Fast and Efficient Storage Access with RDMA Up to 130Tb/s Switching Capacity per Switch Ultra-Low Latency of 300ns Integrated Network Manager Terabit-Speed InfiniBand Networking per Node … Rack 1 Rack 16 Compute Backplane Switch Storage Backplane Switch 64 DGX-2 GPFS 200 Gb/s per node 800 Gb/s per node
  33. 43 MLPERF 2019 NVIDIA DGX SUPERPOD BREAKS AI RECORDS Record

    Type Benchmark Record Max Scale (Minutes to Train) Object Detection (Heavy Weight) Mask R-CNN 18.47 Mins Translation (Recurrent) GNMT 1.8 Mins Reinforcement Learning (MiniGo) 13.57 Mins Per Accelerator (Hours to Train) Object Detection (Heavy Weight) Mask R-CNN 25.39 Hrs Object Detection (Light Weight) SSD 3.04 Hrs Translation (Recurrent) GNMT 2.63 Hrs Translation (Non-recurrent) Transformer 2.61 Hrs Reinforcement Learning (MiniGo) 3.65 Hrs Per Accelerator comparison using reported performance for MLPerf 0.6 NVIDIA DGX-2H (16 V100s) compared to other submissions at same scale except for MiniGo where NVIDIA DGX-1 (8 V100s) submission was used| MLPerf ID Max Scale: Mask R-CNN: 0.6-23, GNMT: 0.6-26, MiniGo: 0.6-11 | MLPerf ID Per Accelerator: Mask R-CNN, SSD, GNMT, Transformer: all use 0.6-20, MiniGo: 0.6-10
  34. 44 DGX SUPERPOD AI ÖéÜÂÿ Ô×ÙÊ DGX SuperPOD Multi-Node NVIDIA

    GPU Containers Docker NVIDIA GPU Cloud Applications Kubernetes & Slurm DGX SuperPOD Mgmt Software Cluster Mgmt, Orchestration, Workload Scheduler GPU Enabled
  35. 48 WEBINAR Volta / Turing Tensor Automatic Mixed Precision (AMP)

    AUTOMATIC MIXED PRECISION https://info.nvidia.com/jp-amp-webinar-reg-page.html
  36. 49 WEBINAR HPC OpenACC GPU OpenACC OpenACC OpenACC GPU -

    - https://info.nvidia.com/intro-openacc-jp-reg-page.html