HPC / AI を支える GPU コンピューティング最新情報 / 2019-09-10 HPE HPC & AI フォーラム 2019

Shinnosuke Furuya, Ph.D., HPC Developer Relations, NVIDIA, 09/10/2019 [email protected] |
@sfuruyaz HPC / AI ¦9¤ GPU ºæÔÞèÆ¬æ·%@4

3 °ÌÓÇ¬« AI ºæÔÞèÆ¬æ·³æÑËè 1993 CEO 12,000 2018 H 97
1600

4 #=N~ #Ai ³Æ»áÁ Carsales 20,000 Carsales Cyclops GPU P
AI Cyclops 55

5 GPU c¥ ÀÚèÈa $ Blue River Technologies (John Deere
) GPU Blue River Technology See & Spray 90%

6 CT scans are increasingly obtained today to diagnose lung
cancer, adding to an already unmanageable workload for radiologists. Further, very small pulmonary nodes are difficult to spot with the human eye. Powered by NVIDIA GPUs on the NVIDIA Clara platform, 12 Sigma Technologies’ σ-Discover/Lung system automatically detects lung nodules as small as .01% of an image, analyzes malignancy with >90% accuracy and provides a decision support tool to radiologists. When optimized on an NVIDIA T4 cluster the system runs up to 18x faster. §V'

7 TESLA PLATFORM

8 x! NVIDIA GPU WM (j) Maxwell (2014) Pascal (2016)
Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000

Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Fermi H (FP64)

Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Pascal FP16 FP16 FP32 2

Volta (2017) M40 HPC P GRID P DL P M60 Kepler (2012) K80 K2 K520 V100 & Tesla P40 P100 P6 Fermi (2010) M2070 GeForce GTX 980 GTX 780 GTX 1080 TITAN X TITAN V GTX 580 P4 M6 M10 Turing (2018) T4 RTX 2080 SUPER TITAN RTX Quadro M6000 GP100 P5000 K6000 6000 GV100 RTX 8000 RTX 6000 Volta / Turing FP16 FP32 H Tensor Tesla V100 FP32: 15.7 TFLOPS Tensor H 125 TFLOPS (8 ) Turing INT8 INT4 Tesla T4 INT4 260 TOPS

12 NVIDIA TESLA V100 210 | TSMC 12nm FFN |
815mm2 5120 CUDA 640 Tensor 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS 125 Tensor TFLOPS 20MB | 16MB 900 GB/s 32GB HBM2 300 GB/s NVLink AI HPC Tensor Volta

14 VOLTA HPC*G¦0 P100 HPC System Config Info: 2X Xeon
E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware. Summit Supercomputer 200 PetaFlops 4,608 Nodes

15 VOLTA Ç¬è×áèËæ·*G¦g0 P100 V100 P100 V100 Images per Second
Images per Second 2.4x faster 3.7x faster FP32 Tensor FP16 Tensor ÈäèËæ· æÕªäæÀ TensorRT - 7ms Latency (*) DL ResNet50

16 TENSOR º« H D = FP32 (FP16) FP16 FP16
FP32 (FP16) A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3 B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3 C0,0 C0,1 C0,2 C0,3 C1,0 C1,1 C1,2 C1,3 C2,0 C2,1 C2,2 C2,3 C3,0 C3,1 C3,2 C3,3 FMA (Fused Multiply-Add) 4x4 1 : 128 / /Tensor 1024 / /SM A B C

17 HPC AI 7 k"SZ¦[ AI > ]ÇèÃ¢ÑÃèæ¦"§ËÞèáã ÍÄÈåè¶
> ;wp0äÀÙæÀc HPC > 1&?«ã»âÁÜ > E¥ÝÇã.K>I 90% Prediction Accuracy Publish in Nature April 2019 Tensor Cores Achieved 1.13 EF 2018 Gordon Bell Winner Orders Of Magnitude Speedup 3M New Compounds In 1 Day Time-to-solution Reduced From Weeks To 2 Hours

18 k"SZ:L TENSOR º«J H 7.8 15.7 125 0 20
40 60 80 100 120 140 V100 TFLOPS Tesla V100 O*G f7 ×áÁÚu6 (²è¶âÄ¿\d+) FP16 Solver 3.5x faster 5}¾ÛÞäè¾àæ (("}\d+) FP16-FP21-FP32-FP64 25x faster v pnL¡¤,Y;w (²Ä¶ÀÕ±èÉ") FP16/FP32/FP64 4x faster

19 HPL-AI F^<UÂãÐè SUMMIT HPL-AI 3 s*G¦C` HPL-AI:
AI HPC AI 7 HPC (Simulation) – FP64 AI (Machine Learning) – FP16, FP32 Tensor º« GPU ¦P¤ Summit 3 s*G FP64 (HPL) Mixed Precision (HPL-AI) 149 PF 445 PF Proposed by Prof Jack Dongarra, et al

20 {çt¤ÝÇã 2016 - Baidu Deep Speech 2 2015 -
Microsoft ResNet 2017 - Google NMT 105 ExaFLOPS 1GPU 1 20 ExaFLOPS 1GPU 2.5 7 ExaFLOPS 1GPU 1

21 GPU2 z"o GPU1 GPU1 GPU2 • GPU • GPU
• • GPU • GPU

22 z"o(ÇèÃmy)%T Processor DL framework Time Microsoft Tesla P100 x8
Caffe 29 hours Facebook Tesla P100 x256 Caffe2 1 hour Google TPUv2 x256 TensorFlow 30 mins PFN Tesla P100 x1024 Chainer 15 mins Tencent Tesla P40 x2048 TensorFlow 6.6 mins SONY Tesla V100 x2176 NNL 3.7 mins Google TPUv3 x1024 TensorFlow 2.2 mins Tesla V100 x2048 MxNet 75 sec ImageNet + ResNet50

23 NGC

25 q-ÑÕ±èÚæÀ<l 0 2000 4000 6000 8000 10000 12000 18.02
18.09 19.02 Images/Second MxNet Mixed Precision | 128 Batch Size | ResNet-50 Training | 8x V100 0 50000 100000 150000 200000 250000 300000 350000 400000 18.05 18.09 19.02 Tokens/Second PyTorch 0 1000 2000 3000 4000 5000 6000 7000 8000 18.02 18.09 19.02 Images/Second TensorFlow Mixed Precision | 128 Batch Size | GNMT | 8x V100 Mixed Precision | 256 Batch Size | ResNet-50 Training | 8x V100 Speedup across Chroma, GROMACS, LAMMPS, QE, MILC, VASP, SPECFEM3D, NAMD, AMBER, GTC, RTM | 4x V100 v. Dual-Skylake | CUDA 9 for Mar '18 & Nov '18, CUDA 10 for Mar '19 x 2x 4x 6x 8x 10x 12x 14x 16x 18x Mar '18 Nov '18 Mar '19 HPC Applications

26 NGC HPC GPU NGC 50 DL, ML, HPC
P H Innovate Faster Deploy Anywhere Simplify Deployments

27 NGC ÝÇãÀ¶â×È P Tensor 18 P • Tensor •
AMP • NADIA C • Tensor SOTA • • : • NVIDIA NGC https://ngc.nvidia.com/catalog/model-scripts • GitHub https://www.github.com/NVIDIA/deeplearningexamples • NVIDIA NGC Framework containers https://ngc.nvidia.com/catalog/containers

28 bÝÇãÀ¶â×È https://developer.nvidia.com/deep-learning-examples Computer Vision Speech & NLP Recommender Systems
• SSD PyTorch • SSD TensorFlow • UNET-Industrial TensorFlow • UNET-Medical TensorFlow • ResNet-50 v1.5 MXNet • ResNet-50 PyTorch • ResNet-50 TensorFlow • Mask R-CNN PyTorch • GNMT v2 TensorFlow • GNMT v2 PyTorch • Transformer PyTorch • BERT (Pre-training and Q&A) TensorFlow • NCF PyTorch • NCF TensorFlow Text to Speech • Tacotron2 and WaveGlow PyTorch

29 NGC "o8ÝÇãÕªã • P : TensorRT, TensorFlow, PyTorch, MXNet.
• : ImageNet, MSCOCO, LibreSpeech, Wikipedia/BookCorpus, . • H : FP32, FP16, and INT8 • Key customer benefits: P

30 NGC CONTAINER REPLICATOR • NGC • | | •
C • • Singularity • Github | How-to guide Docker Singularity v1 v1 v1 NGC Cron job Time v1 v1 v1 v1 v2 v1 v1 Cron job v2 NGC v1 v1 v2 v1 v1 v1 v2 NGC

31 D¤ NGC 2 200 800 P Singularity
¥ ¤ 10 )/¿àÖ 80% NGC ºæÆÊ H ¥ https://blogs.nvidia.co.jp/2019/06/19/abci-adopts-ngc/

32 NGC ºæÆÊè ßè½è´É https://www.nvidia.com/content/dam/en-zz/ja/Solutions/cloud/NGC-User-Guide_JA.pdf

33 CLARA

34 CLARA AI ÅèãµÄÈ P AI AI – | 10
1 | 13 | developer.nvidia.com/clara P AI AI

35 SYSTEM OEM CLOUD TESLA GPU NVIDIA DGX FAMILY CUDA
CuBLA S NPP CuFFT NCCL COMPUTE cuBLAS cuFFT NPP ARTIFICIAL INTELLIGENCE cuDNN DALI TRT VISUALIZATION OPTIX INDEX KUBERNETES Clara Train SDK PRE-TRAINED MODELS TRANSFER LEARNING AI-ASSISTED ANNOTATION DICOM 2 NIFTI SAMPLE TRAINING PIPELINES TRAIN AI INFERENCE PIPELINE MANAGER STREAMING RENDER DATA MEMORY OPTIMIZATION DICOM ADAPTER SAMPLE DEPLOYMENT PIPELINES DASHBOARD HARDWARE AX LIBS Clara Deploy SDK SOFTWARE STACK NVENC CLARA DEVELOPER PLATFORM Intelligent Compute Platform For Medical Imaging

36 AI-ASSISTED ANNOTATION

37 • Auto Segmentation: Allows for automatic segmentation to be
applied across all slices, the user can interactively correct the segmentation extreme points. • Interactive Annotation: Allows 6-click organ-annotation. User can apply Auto-segmentation and then correct the extreme point in interactive mode • Annotation and Segmentation models continuously learning from user inputs. AI-ASSISTED ANNOTATION APIs to plug into any existing medical viewer 10X

38 • Transfers weights & learned features without sharing the
source data • Optimized pre-trained models with state-of-the art augmentation and data transforms • Get started quickly with Kubeflow training pipelines created by an NVIDIA data scientist TRANSFER LEARNING Tool to create a new network from an existing DNN

39 B£ Contact our Life Science / Medical Imaging Solution
Architect! (Colleen Ruan), Deep Learning Solution Architect [email protected]

41 SNS Facebook: NVIDIA AI Japan | https://www.facebook.com/NVIDIAAI.JP Twitter: @NVIDIAAIJP
| https://twitter.com/NVIDIAAIJP Follow and Like us!

43 WEBINAR H Volta / Turing Tensor H Automatic Mixed
Precision (AMP) AUTOMATIC MIXED PRECISION https://info.nvidia.com/jp-amp-webinar-reg-page.html

44 WEBINAR HPC OpenACC GPU OpenACC P OpenACC C OpenACC
GPU - - https://info.nvidia.com/intro-openacc-jp-reg-page.html

HPC / AI を支える GPU コンピューティング最新情報 / 2019-09-10 HP...

HPC / AI を支える GPU コンピューティング最新情報 / 2019-09-10 HPE HPC & AI フォーラム 2019

More Decks by Shinnosuke Furuya

Other Decks in Technology

Featured

Transcript