(Golang) ◦ Receive image ◦ Analysis ◦ Data sync upon notification of result • Data model stored in AWS S3 • Database: AWS MR (HBase) • Service Infras: ◦ ZooKeeper (Service Discovery) ◦ NATS Messaging ◦ Redis
Deep Learning Application VR/ AR/ Healthcare Successful Deep Learning Application Hard to learn Computationally intensive Deep Learning Algorithm New Learning Techniques Deep Learning Infrastructure Parallel + multiple GPUs Specific Knowledge Adopt & Integrate https://research.htc.com/
Finding a good set of parameters – Data augmentation / Initialization / Pre-processing / … • System aspect – Managing lots of computing resources – Performance tuning
infrastructures together enable the current deep learning renaissance • Fast model training is the key to the success of deep learning algorithm development • Efficient training demands both algorithmic improvement and careful system configuration The Importance of Model Training
GPU memory usage • Trade-off between speed and time • Batch size selection can be formulated as an optimization problem* * Distributed Training Large-Scale Deep Architectures, Mina Zou, et al., HTC Research, ADMA 2017.
provision peer to peer data transfer ◦ Peer to peer parameter synchronization • Distributed training ◦ Increase computation time (e.g. larger batch size) to hide transmission efforts and reduce # of updates ◦ Network capacity of parameter server could be the bottleneck Scale the Training
allocation • Support multiple deep learning frameworks • Configuration advisor • Status monitoring • Task management • Scalable training infrastructure
Resource Manager Worker Platform Config User code 1:Request 8:Response 2:Submit Tasks 3:Get Tasks 6:Notify when any task is done 4:Assign Tasks 7:Get Result from Storage Logging 5:Do Tasks Check/Report Status Get Data / Save Result Workflow
models ◦ support of new optimization techniques and operators • Performance • Scalability • Popularity ◦ community support • Third party support ◦ visualization library • Deployment ◦ multi-platform support Selecting a Computation Frameworks
The most mature GPU instance • The GPU disappears due to hw/sw/fw issues sometimes • If you are doing research, AWS may be the good choice, because it has best GPU performance
and Cognitive service integration • Unfriendly (Linux) user interface • If you are doing research, Azure may be the best choice, because it gives lots of credits
TPU in the future • GPU performance is not optimized (yet!) • The GPU won't disappear based on our experiences • If you are creating a product, GCP may be best choice, because it has faster development cycle and stabler instance