深度學習環境建置與模型訓練實務

Slide 1

Slide 1 text

2017 台灣人工智慧年會深度學習環境建置與模型訓練實務宋政隆 ([email protected])

Slide 2

Slide 2 text

2017 台灣人工智慧年會 About Me ● 2001 - 2010 ○ IASL, IIS, SINICA ● 2010 - 2012 ○ MAGICLabs, HTC ● 2012 - 2014 ○ Cloudioh (start-up) ● 2014 - ○ Studio Engineering, HTC ○ Healthcare, HTC Neural Network Network & AI Cloud Computing Big Data Neural Network AI & everything... NLP Deep

Slide 3

Slide 3 text

2017 台灣人工智慧年會 Tricorder Ref: https://www.htc.com/tw/about/newsroom/2017/2017-04-13/

Slide 4

Slide 4 text

2017 台灣人工智慧年會 Cloud Service Infrastructure of Tricorder ● API Frontend (Golang) ○ Receive image ○ Analysis ○ Data sync upon notification of result ● Data model stored in AWS S3 ● Database: AWS MR (HBase) ● Service Infras: ○ ZooKeeper (Service Discovery) ○ NATS Messaging ○ Redis

Slide 5

Slide 5 text

2017 台灣人工智慧年會 CDC1922 Ref: https://deepq.com/article/CDC1922Bot

Slide 6

Slide 6 text

2017 台灣人工智慧年會 Cloud Service Infrastructure of CDC1922 LINE Bot ● Webhook Endpoint Service (Golang) ○ 流感疫苗常見問答 ○ 公自費診所指引 ○ 副作用追踪 ● Database: GCP Cloud SQL (MySQL) ● Service Infras: ○ Etcd (Service Discovery) ○ GKE (container deployment, scaling, and management) ○ Redis

Slide 7

Slide 7 text

2017 台灣人工智慧年會 ai.deepq.com

Slide 8

Slide 8 text

2017 台灣人工智慧年會

Slide 9

Slide 9 text

2017 台灣人工智慧年會

Slide 10

Slide 10 text

2017 台灣人工智慧年會 Cloud Computing to help Deep Learning

Slide 11

Slide 11 text

2017 台灣人工智慧年會 What Makes Deep Learning Success Data Computing Algorithm

Slide 12

Slide 12 text

2017 台灣人工智慧年會 Our Deep Learning Teams Deeper Models Big Data Deep Learning Application VR/ AR/ Healthcare Successful Deep Learning Application Hard to learn Computationally intensive Deep Learning Algorithm New Learning Techniques Deep Learning Infrastructure Parallel + multiple GPUs Specific Knowledge Adopt & Integrate https://research.htc.com/

Slide 13

Slide 13 text

2017 台灣人工智慧年會 Deep Learning Workflow Get Data Process Data Training Network Design Parameters Evaluate Deployment

Slide 14

Slide 14 text

2017 台灣人工智慧年會 Challenges on Training Models • Algorithmic aspect – Finding a good set of parameters – Data augmentation / Initialization / Pre-processing / … • System aspect – Managing lots of computing resources – Performance tuning

Slide 15

Slide 15 text

2017 台灣人工智慧年會 What Makes Deep Learning Engineering Hard Cheap Quality Fast

Slide 16

Slide 16 text

2017 台灣人工智慧年會 ● Scale of data and scale of computation infrastructures together enable the current deep learning renaissance ● Fast model training is the key to the success of deep learning algorithm development ● Efficient training demands both algorithmic improvement and careful system configuration The Importance of Model Training

Slide 17

Slide 17 text

2017 台灣人工智慧年會 ● Optimizer (Gradient Descent / Adam / Adagrad / …) ● Adjust hyper-parameters of gradient-based training ● Faster kernel implementations Algorithmic Approach

Slide 18

Slide 18 text

2017 台灣人工智慧年會 ● Leverage computing resources to speed up the training process ○ Parallelize the computations ○ Reduce computation overhead System Approach

Slide 19

Slide 19 text

2017 台灣人工智慧年會 Generic Deep Learning System Architecture

Slide 20

Slide 20 text

2017 台灣人工智慧年會 Performance Bottlenecks ● Computation ○ Gradient computation inside GPU ○ Data preparation (I/O, preprocessing) ● Synchronization of model parameters ○ GPU to GPU ○ Parameter Server

Slide 21

Slide 21 text

2017 台灣人工智慧年會 Dual Impacts of Mini-batch Size

Slide 22

Slide 22 text

2017 台灣人工智慧年會 Mini-batch size ● Batch size and network decide GPU memory usage ● Trade-off between speed and time ● Batch size selection can be formulated as an optimization problem* * Distributed Training Large-Scale Deep Architectures, Mina Zou, et al., HTC Research, ADMA 2017.

Slide 23

Slide 23 text

2017 台灣人工智慧年會 ● Adjust batch size ○ Smaller batch size with similar convergence quality ● Fine-tune your network ○ Larger stride for convolution ○ Smaller fully connected layer ○ … Tuning for Performance

Slide 24

Slide 24 text

2017 台灣人工智慧年會 ● Data/Computation pipelining ○ Number of data threads ○ Disk I/O is expensive I/O Handling Not good Good

Slide 25

Slide 25 text

2017 台灣人工智慧年會 ● Multi-GPU training ○ demands for faster data provision peer to peer data transfer ○ Peer to peer parameter synchronization ● Distributed training ○ Increase computation time (e.g. larger batch size) to hide transmission efforts and reduce # of updates ○ Network capacity of parameter server could be the bottleneck Scale the Training

Slide 26

Slide 26 text

2017 台灣人工智慧年會 DeepQ AI Platform

Slide 27

Slide 27 text

2017 台灣人工智慧年會 Training a Model From Scratch ● Hyper parameter tuning ● Data preprocessing ● Performance tuning ● Model optimization ● Task management ● Configuration ● Software update ● Resource management ● Operations (Ops)

Slide 28

Slide 28 text

2017 台灣人工智慧年會 DeepQ Open AI Platform Infrastructure Container Computation Framework Services GPU CPU Storage Network Docker Web Admin Task Manager Resource Manager Service Discovery CNN Reinforcement Learning LSTM … RNN

Slide 29

Slide 29 text

2017 台灣人工智慧年會

Slide 30

Slide 30 text

2017 台灣人工智慧年會 DeepQ Open AI Platform Features ● On-demand resource allocation ● Support multiple deep learning frameworks ● Configuration advisor ● Status monitoring ● Task management ● Scalable training infrastructure

Slide 31

Slide 31 text

2017 台灣人工智慧年會 Monitor Frontend Alert Logger Storage User Task Queue Resource Manager Worker Platform Config User code 1:Request 8:Response 2:Submit Tasks 3:Get Tasks 6:Notify when any task is done 4:Assign Tasks 7:Get Result from Storage Logging 5:Do Tasks Check/Report Status Get Data / Save Result Workflow

Slide 32

Slide 32 text

2017 台灣人工智慧年會 Service stacks of DeepQ AI Platform ● Web frontend (NodeJS) ● Worker ○ Smart mode model trainer (Python) ○ Job dispatcher (Golang) ● Service components (Golang) ○ Task queue (based on disque) ○ Resource manager ● Database: Firebase ● Storage: GCS, AWS S3 ● Service Infras: ○ Consul (Service Discovery) ○ Redis

Slide 33

Slide 33 text

2017 台灣人工智慧年會 Configuration Management

Slide 34

Slide 34 text

2017 台灣人工智慧年會 Ref: http://www.slidesshre.net/arthurlutz/debian-meetup-nantes-2015-salt-pour-grer-de-nombreux-serveurs-debian

Slide 35

Slide 35 text

2017 台灣人工智慧年會 Salt-Cloud ● Provision systems on cloud providers, hypervisors

Slide 36

Slide 36 text

2017 台灣人工智慧年會 Salt-Cloud (AWS) ec2-us-west-2-public: minion: master: ip-172-31-30-32 id: 'AWS id' key: 'AWS key+iIP21RaHNBq1DOMaQMkOAgF' private_key: /etc/salt/secret keyname: csiuser-dl-oregon ssh_interface: public_ips securitygroup: security location: us-west-2 iam_profile: arn:iam_role driver: ec2 del_root_vol_on_destroy: True del_all_vols_on_destroy: True rename_on_destroy: True % salt-cloud -p gpu gpuwork1 % salt-cloud -d gpuwork1 gpu: image: ami-d732f0b7 size: g2.2xlarge location: us-west-2 network: default grains: role: gpu tags: {'Environment', 'dev'} del_root_vol_on_destroy: True block_device_mappings: - DeviceName: /dev/sda1 Ebs.VolumeSize: 120 Ebs.VolumeType: gp2 del_all_vol_on_destroy: True ssh_username: ubuntu make_master: False sync_after_install: grains provider: ec2-us-west-2-public

Slide 37

Slide 37 text

2017 台灣人工智慧年會 Service Discovery

Slide 38

Slide 38 text

2017 台灣人工智慧年會 Monitoring

Slide 39

Slide 39 text

2017 台灣人工智慧年會 ● Easy to present your thoughts ○ programming models ○ support of new optimization techniques and operators ● Performance ● Scalability ● Popularity ○ community support ● Third party support ○ visualization library ● Deployment ○ multi-platform support Selecting a Computation Frameworks

Slide 40

Slide 40 text

2017 台灣人工智慧年會文档乱、调试难…TensorFlow有那么多缺点，但为何我们依然待它如初恋？ https://www.leiphone.com/news/201709/3T4pwc5UBLtRuKvx.html Github statistics

Slide 41

Slide 41 text

2017 台灣人工智慧年會 Deep Learning Framework on arxiv https://medium.com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106

Slide 42

Slide 42 text

2017 台灣人工智慧年會 AWS, Azure or GCP

Slide 43

Slide 43 text

2017 台灣人工智慧年會 AWS • Best performance among current providers • The most mature GPU instance • The GPU disappears due to hw/sw/fw issues sometimes • If you are doing research, AWS may be the good choice, because it has best GPU performance

Slide 44

Slide 44 text

2017 台灣人工智慧年會 Azure • Lowest price in market • CNTK and Cognitive service integration • Unfriendly (Linux) user interface • If you are doing research, Azure may be the best choice, because it gives lots of credits

Slide 45

Slide 45 text

2017 台灣人工智慧年會 GCP • Flexible instance configuration • May support TPU in the future • GPU performance is not optimized (yet!) • The GPU won't disappear based on our experiences • If you are creating a product, GCP may be best choice, because it has faster development cycle and stabler instance

Slide 46

Slide 46 text

2017 台灣人工智慧年會 Brief Comparison Service Provider Azure AWS GCP Price Lowest Highest Medium Performance Third Best Second Stability* Third Second Best Maturity Third Best Second Flexibility Same as AWS Same as Azure Best

Slide 47

Slide 47 text

2017 台灣人工智慧年會 Thank you!