Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HPC / AI を支える GPU コンピューティング最新情報 / 2019-09-10 HPE HPC & AI フォーラム 2019

HPC / AI を支える GPU コンピューティング最新情報 / 2019-09-10 HPE HPC & AI フォーラム 2019

Shinnosuke Furuya

September 10, 2019
Tweet

More Decks by Shinnosuke Furuya

Other Decks in Technology

Transcript

  1. 5 GPU ˜cŽ¥‘ ÀÚèÈa $‚ Blue River Technologies (John Deere

    ) GPU Blue River Technology See & Spray 90%
  2. 6 CT scans are increasingly obtained today to diagnose lung

    cancer, adding to an already unmanageable workload for radiologists. Further, very small pulmonary nodes are difficult to spot with the human eye. Powered by NVIDIA GPUs on the NVIDIA Clara platform, 12 Sigma Technologies’ σ-Discover/Lung system automatically detects lung nodules as small as .01% of an image, analyzes malignancy with >90% accuracy and provides a decision support tool to radiologists. When optimized on an NVIDIA T4 cluster the system runs up to 18x faster. ƒ‹§œV'
  3. 8 x! NVIDIA GPU WM (j…) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000
  4. 9 x! NVIDIA GPU WM (j…) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Fermi H (FP64)
  5. 10 x! NVIDIA GPU WM (j…) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Pascal FP16 FP16 FP32 2
  6. 11 x! NVIDIA GPU WM (j…) Maxwell (2014) Pascal (2016)

    Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Volta / Turing FP16 FP32 H Tensor Tesla V100 FP32: 15.7 TFLOPS Tensor H 125 TFLOPS (8 ) Turing INT8 INT4 Tesla T4 INT4 260 TOPS
  7. 12 NVIDIA TESLA V100 210 | TSMC 12nm FFN |

    815mm2 5120 CUDA 640 Tensor 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS 125 Tensor TFLOPS 20MB | 16MB 900 GB/s 32GB HBM2 300 GB/s NVLink AI HPC Tensor Volta
  8. 13 TESLA T4 320 Turing Tensor 2,560 CUDA 8.1 FP32

    TFLOPS | 65 FP16 TFLOPS 130 INT8 TOPS | 260 INT4 TOPS 16GB 320GB/s GDDR6 70 W & | HPC | | GPU
  9. 14 VOLTA HPC*G¦Œ0 P100 HPC System Config Info: 2X Xeon

    E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware. Summit Supercomputer 200 PetaFlops 4,608 Nodes
  10. 15 VOLTA Ǭè×áèËæ·*G¦g›0 P100 V100 P100 V100 Images per Second

    Images per Second 2.4x faster 3.7x faster FP32 Tensor FP16 Tensor ÈäèËæ· ­æÕªäæÀ TensorRT - 7ms Latency (*) DL ResNet50
  11. 16 TENSOR º« H D = FP32 (FP16) FP16 FP16

    FP32 (FP16) A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3 B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3 C0,0 C0,1 C0,2 C0,3 C1,0 C1,1 C1,2 C1,3 C2,0 C2,1 C2,2 C2,3 C3,0 C3,1 C3,2 C3,3 FMA (Fused Multiply-Add) 4x4 1 : 128 / /Tensor 1024 / /SM A B C
  12. 17 HPC ™ AI œ7 ‹k"SZ¦[ AI > ]œÇèÊ¢ÑÃèæ¦"§’ËÞèáã ÍÄÈåè¶

    > ;wpœ0™äÀÙæÀœc HPC > 1&›?–«ã»âÁÜ > EŽ¥‘ÝÇã˜.Kš>I 90% Prediction Accuracy Publish in Nature April 2019 Tensor Cores Achieved 1.13 EF 2018 Gordon Bell Winner Orders Of Magnitude Speedup 3M New Compounds In 1 Day Time-to-solution Reduced From Weeks To 2 Hours
  13. 18 k"SZ:L˜œ TENSOR º«œJ H 7.8 15.7 125 0 20

    40 60 80 100 120 140 V100 TFLOPS Tesla V100 œO*G f7 œ×áÁÚu6 (²è¶âÄ¿\d+) FP16 Solver 3.5x faster 5œ}¾ÛÞäè¾àæ (("}\d+) FP16-FP21-FP32-FP64 25x faster v pnL›¡¤,Y;w (²Ä¶ÀÕ±èÉ") FP16/FP32/FP64 4x faster
  14. 19 HPL-AI ™ F^<UÂãÐè SUMMIT ˜œ HPL-AI ‹3 sœ*G¦C` HPL-AI:

    AI HPC ™ AI œ7 HPC (Simulation) – FP64 AI (Machine Learning) – FP16, FP32 Tensor º« GPU ¦P‰¤ Summit ˜ 3 sœ*G FP64 (HPL) Mixed Precision (HPL-AI) 149 PF 445 PF Proposed by Prof Jack Dongarra, et al
  15. 20 {çt¤ÝÇã 2016 - Baidu Deep Speech 2 2015 -

    Microsoft ResNet 2017 - Google NMT 105 ExaFLOPS 1GPU 1 20 ExaFLOPS 1GPU 2.5 7 ExaFLOPS 1GPU 1
  16. 22 z"o(ÇèÃmy)œ%T Processor DL framework Time Microsoft Tesla P100 x8

    Caffe 29 hours Facebook Tesla P100 x256 Caffe2 1 hour Google TPUv2 x256 TensorFlow 30 mins PFN Tesla P100 x1024 Chainer 15 mins Tencent Tesla P40 x2048 TensorFlow 6.6 mins SONY Tesla V100 x2176 NNL 3.7 mins Google TPUv3 x1024 TensorFlow 2.2 mins Tesla V100 x2048 MxNet 75 sec ImageNet + ResNet50
  17. 24 NGC: GPU QÂÕÈ®¯« ÏÖ GPU 50 P Ǭè×áèËæ· rœ$‚"o«ã»âÁÜ

    HPC RX hO ¹ÎÛ¶À NAMD | GROMACS | more RAPIDS | H2O | more TensorRT | DeepStream | more Parabricks ParaView | IndeX | more TensorFlow | PyTorch | more
  18. 25 q-šÑÕ±èÚæÀ<l 0 2000 4000 6000 8000 10000 12000 18.02

    18.09 19.02 Images/Second MxNet Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100 0 50000 100000 150000 200000 250000 300000 350000 400000 18.05 18.09 19.02 Tokens/Second PyTorch 0 1000 2000 3000 4000 5000 6000 7000 8000 18.02 18.09 19.02 Images/Second TensorFlow Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100 Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19 x 2x 4x 6x 8x 10x 12x 14x 16x 18x Mar '18 Nov '18 Mar '19 HPC Applications
  19. 26 ˆ NGC HPC GPU NGC 50 DL, ML, HPC

    P H Innovate Faster Deploy Anywhere Simplify Deployments
  20. 27 NGC ÝÇãÀ¶â×È P Tensor 18 P • Tensor •

    AMP • NADIA C • Tensor SOTA • • : • NVIDIA NGC https://ngc.nvidia.com/catalog/model-scripts • GitHub https://www.github.com/NVIDIA/deeplearningexamples • NVIDIA NGC Framework containers https://ngc.nvidia.com/catalog/containers
  21. 28 bšÝÇãÀ¶â×È https://developer.nvidia.com/deep-learning-examples Computer Vision Speech & NLP Recommender Systems

    • SSD PyTorch • SSD TensorFlow • UNET-Industrial TensorFlow • UNET-Medical TensorFlow • ResNet-50 v1.5 MXNet • ResNet-50 PyTorch • ResNet-50 TensorFlow • Mask R-CNN PyTorch • GNMT v2 TensorFlow • GNMT v2 PyTorch • Transformer PyTorch • BERT (Pre-training and Q&A) TensorFlow • NCF PyTorch • NCF TensorFlow Text to Speech • Tacotron2 and WaveGlow PyTorch
  22. 29 NGC "o8ŸÝÇãÕª­ã • P : TensorRT, TensorFlow, PyTorch, MXNet.

    • : ImageNet, MSCOCO, LibreSpeech, Wikipedia/BookCorpus, . • H : FP32, FP16, and INT8 • Key customer benefits: P
  23. 30 NGC CONTAINER REPLICATOR • NGC • | | •

    C • • Singularity • Github | How-to guide Docker Singularity v1 v1 v1 NGC Cron job Time v1 v1 v1 v1 v2 v1 v1 Cron job v2 NGC v1 v1 v2 v1 v1 v1 v2 NGC
  24. 31 D‹¤ NGC œ2 200 800 P  Singularity ˜

    Ž¥ —ˆ¤ 10 )/œ¿à֜ 80% ˜†NGC œºæÆʋ H Ž¥—ˆž https://blogs.nvidia.co.jp/2019/06/19/abci-adopts-ngc/
  25. 34 CLARA AI ÅèãµÄÈ P AI AI – | 10

    1 | 13 | developer.nvidia.com/clara P AI AI
  26. 35 SYSTEM OEM CLOUD TESLA GPU NVIDIA DGX FAMILY CUDA

    CuBLA S NPP CuFFT NCCL COMPUTE cuBLAS cuFFT NPP ARTIFICIAL INTELLIGENCE cuDNN DALI TRT VISUALIZATION OPTIX INDEX KUBERNETES Clara Train SDK PRE-TRAINED MODELS TRANSFER LEARNING AI-ASSISTED ANNOTATION DICOM 2 NIFTI SAMPLE TRAINING PIPELINES TRAIN AI INFERENCE PIPELINE MANAGER STREAMING RENDER DATA MEMORY OPTIMIZATION DICOM ADAPTER SAMPLE DEPLOYMENT PIPELINES DASHBOARD HARDWARE AX LIBS Clara Deploy SDK SOFTWARE STACK NVENC CLARA DEVELOPER PLATFORM Intelligent Compute Platform For Medical Imaging
  27. 37 • Auto Segmentation: Allows for automatic segmentation to be

    applied across all slices, the user can interactively correct the segmentation extreme points. • Interactive Annotation: Allows 6-click organ-annotation. User can apply Auto-segmentation and then correct the extreme point in interactive mode • Annotation and Segmentation models continuously learning from user inputs. AI-ASSISTED ANNOTATION APIs to plug into any existing medical viewer 10X
  28. 38 • Transfers weights & learned features without sharing the

    source data • Optimized pre-trained models with state-of-the art augmentation and data transforms • Get started quickly with Kubeflow training pipelines created by an NVIDIA data scientist TRANSFER LEARNING Tool to create a new network from an existing DNN
  29. 39 €B£‘ˆ‡‡ Contact our Life Science / Medical Imaging Solution

    Architect!  (Colleen Ruan), Deep Learning Solution Architect [email protected]
  30. 43 WEBINAR H Volta / Turing Tensor H Automatic Mixed

    Precision (AMP) AUTOMATIC MIXED PRECISION https://info.nvidia.com/jp-amp-webinar-reg-page.html
  31. 44 WEBINAR HPC OpenACC GPU OpenACC P OpenACC C OpenACC

    GPU - - https://info.nvidia.com/intro-openacc-jp-reg-page.html