深度學習環境建置與模型訓練實務

2017 台灣人工智慧年會深度學習環境建置與模型訓練實務宋政隆 ([email protected])

2017 台灣人工智慧年會 About Me • 2001 - 2010 ◦ IASL,
IIS, SINICA • 2010 - 2012 ◦ MAGICLabs, HTC • 2012 - 2014 ◦ Cloudioh (start-up) • 2014 - ◦ Studio Engineering, HTC ◦ Healthcare, HTC Neural Network Network & AI Cloud Computing Big Data Neural Network AI & everything... NLP Deep

2017 台灣人工智慧年會 Tricorder Ref: https://www.htc.com/tw/about/newsroom/2017/2017-04-13/

2017 台灣人工智慧年會 Cloud Service Infrastructure of Tricorder • API Frontend
(Golang) ◦ Receive image ◦ Analysis ◦ Data sync upon notification of result • Data model stored in AWS S3 • Database: AWS MR (HBase) • Service Infras: ◦ ZooKeeper (Service Discovery) ◦ NATS Messaging ◦ Redis

2017 台灣人工智慧年會 CDC1922 Ref: https://deepq.com/article/CDC1922Bot

2017 台灣人工智慧年會 Cloud Service Infrastructure of CDC1922 LINE Bot •
Webhook Endpoint Service (Golang) ◦ 流感疫苗常見問答 ◦ 公自費診所指引 ◦ 副作用追踪 • Database: GCP Cloud SQL (MySQL) • Service Infras: ◦ Etcd (Service Discovery) ◦ GKE (container deployment, scaling, and management) ◦ Redis

2017 台灣人工智慧年會 ai.deepq.com

2017 台灣人工智慧年會

2017 台灣人工智慧年會 Cloud Computing to help Deep Learning

2017 台灣人工智慧年會 What Makes Deep Learning Success Data Computing Algorithm

2017 台灣人工智慧年會 Our Deep Learning Teams Deeper Models Big Data
Deep Learning Application VR/ AR/ Healthcare Successful Deep Learning Application Hard to learn Computationally intensive Deep Learning Algorithm New Learning Techniques Deep Learning Infrastructure Parallel + multiple GPUs Specific Knowledge Adopt & Integrate https://research.htc.com/

2017 台灣人工智慧年會 Deep Learning Workflow Get Data Process Data Training
Network Design Parameters Evaluate Deployment

2017 台灣人工智慧年會 Challenges on Training Models • Algorithmic aspect –
Finding a good set of parameters – Data augmentation / Initialization / Pre-processing / … • System aspect – Managing lots of computing resources – Performance tuning

2017 台灣人工智慧年會 What Makes Deep Learning Engineering Hard Cheap Quality
Fast

2017 台灣人工智慧年會 • Scale of data and scale of computation
infrastructures together enable the current deep learning renaissance • Fast model training is the key to the success of deep learning algorithm development • Efficient training demands both algorithmic improvement and careful system configuration The Importance of Model Training

2017 台灣人工智慧年會 • Optimizer (Gradient Descent / Adam / Adagrad
/ …) • Adjust hyper-parameters of gradient-based training • Faster kernel implementations Algorithmic Approach

2017 台灣人工智慧年會 • Leverage computing resources to speed up the
training process ◦ Parallelize the computations ◦ Reduce computation overhead System Approach

2017 台灣人工智慧年會 Generic Deep Learning System Architecture

2017 台灣人工智慧年會 Performance Bottlenecks • Computation ◦ Gradient computation inside
GPU ◦ Data preparation (I/O, preprocessing) • Synchronization of model parameters ◦ GPU to GPU ◦ Parameter Server

2017 台灣人工智慧年會 Dual Impacts of Mini-batch Size

2017 台灣人工智慧年會 Mini-batch size • Batch size and network decide
GPU memory usage • Trade-off between speed and time • Batch size selection can be formulated as an optimization problem* * Distributed Training Large-Scale Deep Architectures, Mina Zou, et al., HTC Research, ADMA 2017.

2017 台灣人工智慧年會 • Adjust batch size ◦ Smaller batch size
with similar convergence quality • Fine-tune your network ◦ Larger stride for convolution ◦ Smaller fully connected layer ◦ … Tuning for Performance

2017 台灣人工智慧年會 • Data/Computation pipelining ◦ Number of data threads
◦ Disk I/O is expensive I/O Handling Not good Good

2017 台灣人工智慧年會 • Multi-GPU training ◦ demands for faster data
provision peer to peer data transfer ◦ Peer to peer parameter synchronization • Distributed training ◦ Increase computation time (e.g. larger batch size) to hide transmission efforts and reduce # of updates ◦ Network capacity of parameter server could be the bottleneck Scale the Training

2017 台灣人工智慧年會 DeepQ AI Platform

2017 台灣人工智慧年會 Training a Model From Scratch • Hyper parameter
tuning • Data preprocessing • Performance tuning • Model optimization • Task management • Configuration • Software update • Resource management • Operations (Ops)

2017 台灣人工智慧年會 DeepQ Open AI Platform Infrastructure Container Computation Framework
Services GPU CPU Storage Network Docker Web Admin Task Manager Resource Manager Service Discovery CNN Reinforcement Learning LSTM … RNN

2017 台灣人工智慧年會

2017 台灣人工智慧年會 DeepQ Open AI Platform Features • On-demand resource
allocation • Support multiple deep learning frameworks • Configuration advisor • Status monitoring • Task management • Scalable training infrastructure

2017 台灣人工智慧年會 Monitor Frontend Alert Logger Storage User Task Queue
Resource Manager Worker Platform Config User code 1:Request 8:Response 2:Submit Tasks 3:Get Tasks 6:Notify when any task is done 4:Assign Tasks 7:Get Result from Storage Logging 5:Do Tasks Check/Report Status Get Data / Save Result Workflow

2017 台灣人工智慧年會 Service stacks of DeepQ AI Platform • Web
frontend (NodeJS) • Worker ◦ Smart mode model trainer (Python) ◦ Job dispatcher (Golang) • Service components (Golang) ◦ Task queue (based on disque) ◦ Resource manager • Database: Firebase • Storage: GCS, AWS S3 • Service Infras: ◦ Consul (Service Discovery) ◦ Redis

2017 台灣人工智慧年會 Configuration Management

2017 台灣人工智慧年會 Ref: http://www.slidesshre.net/arthurlutz/debian-meetup-nantes-2015-salt-pour-grer-de-nombreux-serveurs-debian

2017 台灣人工智慧年會 Salt-Cloud • Provision systems on cloud providers, hypervisors

2017 台灣人工智慧年會 Salt-Cloud (AWS) ec2-us-west-2-public: minion: master: ip-172-31-30-32 id: 'AWS
id' key: 'AWS key+iIP21RaHNBq1DOMaQMkOAgF' private_key: /etc/salt/secret keyname: csiuser-dl-oregon ssh_interface: public_ips securitygroup: security location: us-west-2 iam_profile: arn:iam_role driver: ec2 del_root_vol_on_destroy: True del_all_vols_on_destroy: True rename_on_destroy: True % salt-cloud -p gpu gpuwork1 % salt-cloud -d gpuwork1 gpu: image: ami-d732f0b7 size: g2.2xlarge location: us-west-2 network: default grains: role: gpu tags: {'Environment', 'dev'} del_root_vol_on_destroy: True block_device_mappings: - DeviceName: /dev/sda1 Ebs.VolumeSize: 120 Ebs.VolumeType: gp2 del_all_vol_on_destroy: True ssh_username: ubuntu make_master: False sync_after_install: grains provider: ec2-us-west-2-public

2017 台灣人工智慧年會 Service Discovery

2017 台灣人工智慧年會 Monitoring

2017 台灣人工智慧年會 • Easy to present your thoughts ◦ programming
models ◦ support of new optimization techniques and operators • Performance • Scalability • Popularity ◦ community support • Third party support ◦ visualization library • Deployment ◦ multi-platform support Selecting a Computation Frameworks

2017 台灣人工智慧年會文档乱、调试难…TensorFlow有那么多缺点，但为何我们依然待它如初恋？ https://www.leiphone.com/news/201709/3T4pwc5UBLtRuKvx.html Github statistics

2017 台灣人工智慧年會 Deep Learning Framework on arxiv https://medium.com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106

2017 台灣人工智慧年會 AWS, Azure or GCP

2017 台灣人工智慧年會 AWS • Best performance among current providers •
The most mature GPU instance • The GPU disappears due to hw/sw/fw issues sometimes • If you are doing research, AWS may be the good choice, because it has best GPU performance

2017 台灣人工智慧年會 Azure • Lowest price in market • CNTK
and Cognitive service integration • Unfriendly (Linux) user interface • If you are doing research, Azure may be the best choice, because it gives lots of credits

2017 台灣人工智慧年會 GCP • Flexible instance configuration • May support
TPU in the future • GPU performance is not optimized (yet!) • The GPU won't disappear based on our experiences • If you are creating a product, GCP may be best choice, because it has faster development cycle and stabler instance

2017 台灣人工智慧年會 Brief Comparison Service Provider Azure AWS GCP Price
Lowest Highest Medium Performance Third Best Second Stability* Third Second Best Maturity Third Best Second Flexibility Same as AWS Same as Azure Best

2017 台灣人工智慧年會 Thank you!

深度學習環境建置與模型訓練實務

深度學習環境建置與模型訓練實務

More Decks by Cheng-Lung Sung

Other Decks in Technology

Featured

Transcript