深度學習環境建置與模型訓練實務

 深度學習環境建置與模型訓練實務

台灣人工智慧年會 Day 1 11/9 (四) 15:25-16:10 A1
深度學習環境建置與模型訓練實務

9dc1fb93b959c0d838bf6a900306d9b9?s=128

Cheng-Lung Sung

November 09, 2017
Tweet

Transcript

  1. 2017 台灣人工智慧年會 深度學習環境建置與 模型訓練實務 宋政隆 (cl_sung@htc.com)

  2. 2017 台灣人工智慧年會 About Me • 2001 - 2010 ◦ IASL,

    IIS, SINICA • 2010 - 2012 ◦ MAGICLabs, HTC • 2012 - 2014 ◦ Cloudioh (start-up) • 2014 - ◦ Studio Engineering, HTC ◦ Healthcare, HTC Neural Network Network & AI Cloud Computing Big Data Neural Network AI & everything... NLP Deep
  3. 2017 台灣人工智慧年會 Tricorder Ref: https://www.htc.com/tw/about/newsroom/2017/2017-04-13/

  4. 2017 台灣人工智慧年會 Cloud Service Infrastructure of Tricorder • API Frontend

    (Golang) ◦ Receive image ◦ Analysis ◦ Data sync upon notification of result • Data model stored in AWS S3 • Database: AWS MR (HBase) • Service Infras: ◦ ZooKeeper (Service Discovery) ◦ NATS Messaging ◦ Redis
  5. 2017 台灣人工智慧年會 CDC1922 Ref: https://deepq.com/article/CDC1922Bot

  6. 2017 台灣人工智慧年會 Cloud Service Infrastructure of CDC1922 LINE Bot •

    Webhook Endpoint Service (Golang) ◦ 流感疫苗常見問答 ◦ 公自費診所指引 ◦ 副作用追踪 • Database: GCP Cloud SQL (MySQL) • Service Infras: ◦ Etcd (Service Discovery) ◦ GKE (container deployment, scaling, and management) ◦ Redis
  7. 2017 台灣人工智慧年會 ai.deepq.com

  8. 2017 台灣人工智慧年會

  9. 2017 台灣人工智慧年會

  10. 2017 台灣人工智慧年會 Cloud Computing to help Deep Learning

  11. 2017 台灣人工智慧年會 What Makes Deep Learning Success Data Computing Algorithm

  12. 2017 台灣人工智慧年會 Our Deep Learning Teams Deeper Models Big Data

    Deep Learning Application VR/ AR/ Healthcare Successful Deep Learning Application Hard to learn Computationally intensive Deep Learning Algorithm New Learning Techniques Deep Learning Infrastructure Parallel + multiple GPUs Specific Knowledge Adopt & Integrate https://research.htc.com/
  13. 2017 台灣人工智慧年會 Deep Learning Workflow Get Data Process Data Training

    Network Design Parameters Evaluate Deployment
  14. 2017 台灣人工智慧年會 Challenges on Training Models • Algorithmic aspect –

    Finding a good set of parameters – Data augmentation / Initialization / Pre-processing / … • System aspect – Managing lots of computing resources – Performance tuning
  15. 2017 台灣人工智慧年會 What Makes Deep Learning Engineering Hard Cheap Quality

    Fast
  16. 2017 台灣人工智慧年會 • Scale of data and scale of computation

    infrastructures together enable the current deep learning renaissance • Fast model training is the key to the success of deep learning algorithm development • Efficient training demands both algorithmic improvement and careful system configuration The Importance of Model Training
  17. 2017 台灣人工智慧年會 • Optimizer (Gradient Descent / Adam / Adagrad

    / …) • Adjust hyper-parameters of gradient-based training • Faster kernel implementations Algorithmic Approach
  18. 2017 台灣人工智慧年會 • Leverage computing resources to speed up the

    training process ◦ Parallelize the computations ◦ Reduce computation overhead System Approach
  19. 2017 台灣人工智慧年會 Generic Deep Learning System Architecture

  20. 2017 台灣人工智慧年會 Performance Bottlenecks • Computation ◦ Gradient computation inside

    GPU ◦ Data preparation (I/O, preprocessing) • Synchronization of model parameters ◦ GPU to GPU ◦ Parameter Server
  21. 2017 台灣人工智慧年會 Dual Impacts of Mini-batch Size

  22. 2017 台灣人工智慧年會 Mini-batch size • Batch size and network decide

    GPU memory usage • Trade-off between speed and time • Batch size selection can be formulated as an optimization problem* * Distributed Training Large-Scale Deep Architectures, Mina Zou, et al., HTC Research, ADMA 2017.
  23. 2017 台灣人工智慧年會 • Adjust batch size ◦ Smaller batch size

    with similar convergence quality • Fine-tune your network ◦ Larger stride for convolution ◦ Smaller fully connected layer ◦ … Tuning for Performance
  24. 2017 台灣人工智慧年會 • Data/Computation pipelining ◦ Number of data threads

    ◦ Disk I/O is expensive I/O Handling Not good Good
  25. 2017 台灣人工智慧年會 • Multi-GPU training ◦ demands for faster data

    provision peer to peer data transfer ◦ Peer to peer parameter synchronization • Distributed training ◦ Increase computation time (e.g. larger batch size) to hide transmission efforts and reduce # of updates ◦ Network capacity of parameter server could be the bottleneck Scale the Training
  26. 2017 台灣人工智慧年會 DeepQ AI Platform

  27. 2017 台灣人工智慧年會 Training a Model From Scratch • Hyper parameter

    tuning • Data preprocessing • Performance tuning • Model optimization • Task management • Configuration • Software update • Resource management • Operations (Ops)
  28. 2017 台灣人工智慧年會 DeepQ Open AI Platform Infrastructure Container Computation Framework

    Services GPU CPU Storage Network Docker Web Admin Task Manager Resource Manager Service Discovery CNN Reinforcement Learning LSTM … RNN
  29. 2017 台灣人工智慧年會

  30. 2017 台灣人工智慧年會 DeepQ Open AI Platform Features • On-demand resource

    allocation • Support multiple deep learning frameworks • Configuration advisor • Status monitoring • Task management • Scalable training infrastructure
  31. 2017 台灣人工智慧年會 Monitor Frontend Alert Logger Storage User Task Queue

    Resource Manager Worker Platform Config User code 1:Request 8:Response 2:Submit Tasks 3:Get Tasks 6:Notify when any task is done 4:Assign Tasks 7:Get Result from Storage Logging 5:Do Tasks Check/Report Status Get Data / Save Result Workflow
  32. 2017 台灣人工智慧年會 Service stacks of DeepQ AI Platform • Web

    frontend (NodeJS) • Worker ◦ Smart mode model trainer (Python) ◦ Job dispatcher (Golang) • Service components (Golang) ◦ Task queue (based on disque) ◦ Resource manager • Database: Firebase • Storage: GCS, AWS S3 • Service Infras: ◦ Consul (Service Discovery) ◦ Redis
  33. 2017 台灣人工智慧年會 Configuration Management

  34. 2017 台灣人工智慧年會 Ref: http://www.slidesshre.net/arthurlutz/debian-meetup-nantes-2015-salt-pour-grer-de-nombreux-serveurs-debian

  35. 2017 台灣人工智慧年會 Salt-Cloud • Provision systems on cloud providers, hypervisors

  36. 2017 台灣人工智慧年會 Salt-Cloud (AWS) ec2-us-west-2-public: minion: master: ip-172-31-30-32 id: 'AWS

    id' key: 'AWS key+iIP21RaHNBq1DOMaQMkOAgF' private_key: /etc/salt/secret keyname: csiuser-dl-oregon ssh_interface: public_ips securitygroup: security location: us-west-2 iam_profile: arn:iam_role driver: ec2 del_root_vol_on_destroy: True del_all_vols_on_destroy: True rename_on_destroy: True % salt-cloud -p gpu gpuwork1 % salt-cloud -d gpuwork1 gpu: image: ami-d732f0b7 size: g2.2xlarge location: us-west-2 network: default grains: role: gpu tags: {'Environment', 'dev'} del_root_vol_on_destroy: True block_device_mappings: - DeviceName: /dev/sda1 Ebs.VolumeSize: 120 Ebs.VolumeType: gp2 del_all_vol_on_destroy: True ssh_username: ubuntu make_master: False sync_after_install: grains provider: ec2-us-west-2-public
  37. 2017 台灣人工智慧年會 Service Discovery

  38. 2017 台灣人工智慧年會 Monitoring

  39. 2017 台灣人工智慧年會 • Easy to present your thoughts ◦ programming

    models ◦ support of new optimization techniques and operators • Performance • Scalability • Popularity ◦ community support • Third party support ◦ visualization library • Deployment ◦ multi-platform support Selecting a Computation Frameworks
  40. 2017 台灣人工智慧年會 文档乱、调试难…TensorFlow有那么多缺点,但为何我们依然待它如初恋? https://www.leiphone.com/news/201709/3T4pwc5UBLtRuKvx.html Github statistics

  41. 2017 台灣人工智慧年會 Deep Learning Framework on arxiv https://medium.com/@karpathy/a-peek-at-trends-in-machine-learning-ab8a1085a106

  42. 2017 台灣人工智慧年會 AWS, Azure or GCP

  43. 2017 台灣人工智慧年會 AWS • Best performance among current providers •

    The most mature GPU instance • The GPU disappears due to hw/sw/fw issues sometimes • If you are doing research, AWS may be the good choice, because it has best GPU performance
  44. 2017 台灣人工智慧年會 Azure • Lowest price in market • CNTK

    and Cognitive service integration • Unfriendly (Linux) user interface • If you are doing research, Azure may be the best choice, because it gives lots of credits
  45. 2017 台灣人工智慧年會 GCP • Flexible instance configuration • May support

    TPU in the future • GPU performance is not optimized (yet!) • The GPU won't disappear based on our experiences • If you are creating a product, GCP may be best choice, because it has faster development cycle and stabler instance
  46. 2017 台灣人工智慧年會 Brief Comparison Service Provider Azure AWS GCP Price

    Lowest Highest Medium Performance Third Best Second Stability* Third Second Best Maturity Third Best Second Flexibility Same as AWS Same as Azure Best
  47. 2017 台灣人工智慧年會 Thank you!