Upgrade to Pro — share decks privately, control downloads, hide ads and more …

深度學習環境建置與模型訓練實務

 深度學習環境建置與模型訓練實務

台灣人工智慧年會 Day 1 11/9 (四) 15:25-16:10 A1
深度學習環境建置與模型訓練實務

Cheng-Lung Sung

November 09, 2017
Tweet

More Decks by Cheng-Lung Sung

Other Decks in Technology

Transcript

  1. 2017 台灣人工智慧年會 About Me • 2001 - 2010 ◦ IASL,

    IIS, SINICA • 2010 - 2012 ◦ MAGICLabs, HTC • 2012 - 2014 ◦ Cloudioh (start-up) • 2014 - ◦ Studio Engineering, HTC ◦ Healthcare, HTC Neural Network Network & AI Cloud Computing Big Data Neural Network AI & everything... NLP Deep
  2. 2017 台灣人工智慧年會 Cloud Service Infrastructure of Tricorder • API Frontend

    (Golang) ◦ Receive image ◦ Analysis ◦ Data sync upon notification of result • Data model stored in AWS S3 • Database: AWS MR (HBase) • Service Infras: ◦ ZooKeeper (Service Discovery) ◦ NATS Messaging ◦ Redis
  3. 2017 台灣人工智慧年會 Cloud Service Infrastructure of CDC1922 LINE Bot •

    Webhook Endpoint Service (Golang) ◦ 流感疫苗常見問答 ◦ 公自費診所指引 ◦ 副作用追踪 • Database: GCP Cloud SQL (MySQL) • Service Infras: ◦ Etcd (Service Discovery) ◦ GKE (container deployment, scaling, and management) ◦ Redis
  4. 2017 台灣人工智慧年會 Our Deep Learning Teams Deeper Models Big Data

    Deep Learning Application VR/ AR/ Healthcare Successful Deep Learning Application Hard to learn Computationally intensive Deep Learning Algorithm New Learning Techniques Deep Learning Infrastructure Parallel + multiple GPUs Specific Knowledge Adopt & Integrate https://research.htc.com/
  5. 2017 台灣人工智慧年會 Challenges on Training Models • Algorithmic aspect –

    Finding a good set of parameters – Data augmentation / Initialization / Pre-processing / … • System aspect – Managing lots of computing resources – Performance tuning
  6. 2017 台灣人工智慧年會 • Scale of data and scale of computation

    infrastructures together enable the current deep learning renaissance • Fast model training is the key to the success of deep learning algorithm development • Efficient training demands both algorithmic improvement and careful system configuration The Importance of Model Training
  7. 2017 台灣人工智慧年會 • Optimizer (Gradient Descent / Adam / Adagrad

    / …) • Adjust hyper-parameters of gradient-based training • Faster kernel implementations Algorithmic Approach
  8. 2017 台灣人工智慧年會 • Leverage computing resources to speed up the

    training process ◦ Parallelize the computations ◦ Reduce computation overhead System Approach
  9. 2017 台灣人工智慧年會 Performance Bottlenecks • Computation ◦ Gradient computation inside

    GPU ◦ Data preparation (I/O, preprocessing) • Synchronization of model parameters ◦ GPU to GPU ◦ Parameter Server
  10. 2017 台灣人工智慧年會 Mini-batch size • Batch size and network decide

    GPU memory usage • Trade-off between speed and time • Batch size selection can be formulated as an optimization problem* * Distributed Training Large-Scale Deep Architectures, Mina Zou, et al., HTC Research, ADMA 2017.
  11. 2017 台灣人工智慧年會 • Adjust batch size ◦ Smaller batch size

    with similar convergence quality • Fine-tune your network ◦ Larger stride for convolution ◦ Smaller fully connected layer ◦ … Tuning for Performance
  12. 2017 台灣人工智慧年會 • Multi-GPU training ◦ demands for faster data

    provision peer to peer data transfer ◦ Peer to peer parameter synchronization • Distributed training ◦ Increase computation time (e.g. larger batch size) to hide transmission efforts and reduce # of updates ◦ Network capacity of parameter server could be the bottleneck Scale the Training
  13. 2017 台灣人工智慧年會 Training a Model From Scratch • Hyper parameter

    tuning • Data preprocessing • Performance tuning • Model optimization • Task management • Configuration • Software update • Resource management • Operations (Ops)
  14. 2017 台灣人工智慧年會 DeepQ Open AI Platform Infrastructure Container Computation Framework

    Services GPU CPU Storage Network Docker Web Admin Task Manager Resource Manager Service Discovery CNN Reinforcement Learning LSTM … RNN
  15. 2017 台灣人工智慧年會 DeepQ Open AI Platform Features • On-demand resource

    allocation • Support multiple deep learning frameworks • Configuration advisor • Status monitoring • Task management • Scalable training infrastructure
  16. 2017 台灣人工智慧年會 Monitor Frontend Alert Logger Storage User Task Queue

    Resource Manager Worker Platform Config User code 1:Request 8:Response 2:Submit Tasks 3:Get Tasks 6:Notify when any task is done 4:Assign Tasks 7:Get Result from Storage Logging 5:Do Tasks Check/Report Status Get Data / Save Result Workflow
  17. 2017 台灣人工智慧年會 Service stacks of DeepQ AI Platform • Web

    frontend (NodeJS) • Worker ◦ Smart mode model trainer (Python) ◦ Job dispatcher (Golang) • Service components (Golang) ◦ Task queue (based on disque) ◦ Resource manager • Database: Firebase • Storage: GCS, AWS S3 • Service Infras: ◦ Consul (Service Discovery) ◦ Redis
  18. 2017 台灣人工智慧年會 Salt-Cloud (AWS) ec2-us-west-2-public: minion: master: ip-172-31-30-32 id: 'AWS

    id' key: 'AWS key+iIP21RaHNBq1DOMaQMkOAgF' private_key: /etc/salt/secret keyname: csiuser-dl-oregon ssh_interface: public_ips securitygroup: security location: us-west-2 iam_profile: arn:iam_role driver: ec2 del_root_vol_on_destroy: True del_all_vols_on_destroy: True rename_on_destroy: True % salt-cloud -p gpu gpuwork1 % salt-cloud -d gpuwork1 gpu: image: ami-d732f0b7 size: g2.2xlarge location: us-west-2 network: default grains: role: gpu tags: {'Environment', 'dev'} del_root_vol_on_destroy: True block_device_mappings: - DeviceName: /dev/sda1 Ebs.VolumeSize: 120 Ebs.VolumeType: gp2 del_all_vol_on_destroy: True ssh_username: ubuntu make_master: False sync_after_install: grains provider: ec2-us-west-2-public
  19. 2017 台灣人工智慧年會 • Easy to present your thoughts ◦ programming

    models ◦ support of new optimization techniques and operators • Performance • Scalability • Popularity ◦ community support • Third party support ◦ visualization library • Deployment ◦ multi-platform support Selecting a Computation Frameworks
  20. 2017 台灣人工智慧年會 AWS • Best performance among current providers •

    The most mature GPU instance • The GPU disappears due to hw/sw/fw issues sometimes • If you are doing research, AWS may be the good choice, because it has best GPU performance
  21. 2017 台灣人工智慧年會 Azure • Lowest price in market • CNTK

    and Cognitive service integration • Unfriendly (Linux) user interface • If you are doing research, Azure may be the best choice, because it gives lots of credits
  22. 2017 台灣人工智慧年會 GCP • Flexible instance configuration • May support

    TPU in the future • GPU performance is not optimized (yet!) • The GPU won't disappear based on our experiences • If you are creating a product, GCP may be best choice, because it has faster development cycle and stabler instance
  23. 2017 台灣人工智慧年會 Brief Comparison Service Provider Azure AWS GCP Price

    Lowest Highest Medium Performance Third Best Second Stability* Third Second Best Maturity Third Best Second Flexibility Same as AWS Same as Azure Best