Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations

Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations

The slide of my presentation at BDCAT' 19.
https://dl.acm.org/doi/10.1145/3365109.3368768

Kazuhiro Serizawa

December 04, 2019
Tweet

More Decks by Kazuhiro Serizawa

Other Decks in Research

Transcript

  1. Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch

    Generations Kazuhiro Serizawa1) Osamu Tatebe2) 1) Graduate School of Systems and Information Engineering, University of Tsukuba, Japan 2) Center for Computational Sciences, University of Tsukuba, Japan Kazuhiro Serizawa and Osamu Tatebe. 2019. Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations. In Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (BDCAT '19). ACM, New York, NY, USA, 31-34. DOI: https://doi.org/10.1145/3365109.3368768 2019.12.04 @ New Zealand Auckland University of Technology
  2. INTRODUCTION (1/2) Training dataset used in Deep Neural Network (DNN)

    keeps growing. => Read I/O in DNN training becomes bottleneck. => Methods to improve the read I/O in DNN training are required. 2 %BUBTFU/BNF 5PUBMEBUBTJ[F %BUBUZQF *NBHF/FU VTFEJO *-473$ (# *NBHF ,*55*EBUBTFU (# *NBHF :PV5VCF. 5# *NBHF 1SFWJPVT4UVEZ<> 5# 8FBUIFSTJNVMBUJPO SFTVMU [1] Thorsten Kurth et al., Exascale Deep Learning for Climate Analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18, 2018.
  3. INTRODUCTION (2/2) In HPC cluster, the compute nodes have two

    types of storages; 1. Node-local storages, e.g. NVMe SSD 2. Parallel file system, e.g. Lustre or NFS, and so on. => But using node-local storages requires copy of training dataset 3 Using node-local storages after copying (Staging). Using parallel file systems directly. C m e n de P ce C m e n de P ce c
  4. BACKGROUND (1/2) To improve read I/O performance, Chainer, an OSS

    machine learning framework, provides a function to read training data in parallel (fork-join style) and generate a mini-batch from them. 4 A part of process of MultiprocessIterator (“fetch data” part). file reader process solve the file path read the file convert the file into nump .ndarra inde list data1, data2, data3... [1, 2, 3, 4, 5, 6, 7, 8] mini-batch q e e (on memor ) mini-batch file reader process solve the file path read the file convert the file into nump .ndarra file reader process solve the file path read the file convert the file into nump .ndarra
  5. 0 100 200 300 400 500 2 4 8 12

    16 Number of "file reader process" Elapsed time [sec] fetch data initialize others BACKGROUND (2/2) Preliminary experiment: Generating 1,000 mini-batches with MultiprocessIterator. The result of Node-local SSD looks faster than Lustre’s one. => But there is a staging overhead when using node-local SSD. 5 Lustre node-local SSD (ideal environment) 5IFTFEPOPUJODMVEFUIFUJNF GPSTUBHJOH BCPVUNJOT 0 100 200 300 400 500 2 4 8 12 16 Number of processes of "file reader process" Elapsed time [sec] fetch data initialize others fetch data initialize others fetch data initialize others
  6. PROBLEM and MOTIVATION Node-local storage on HPC cluster can improve

    read I/O performance, => But “staging” of whole training dataset is always required. => Our motivation is to conceal this overhead and reduce overhead. .VMUJQSPDFTT*UFSBUPS XJUIOPEFMPDBM TUPSBHFT TUBHJOHXIPMFUSBJOJOHEBUBTFUJOUPOPEFMPDBMTUPSBHF NJOJCBUDIHFOFSBUJPOT USBJOJOH 5JNF NJOJCBUDIHFOFSBUJPOT USBJOJOH .VMUJQSPDFTT*UFSBUPS XJUI-VTUSF fast mini-batch generations! slow mini-batch generations… Very big Overhead… 6
  7. PROPOSED METHOD (1/2) So we thought that “Overlapping copying data

    in mini-batch unit and in parallel with mini-batch generations” may enable to achieve both of “small overhead” and “fast mini-batch generations”. 5JNF 0VSQSPQPTFE NFUIPEVTJOHPG -VTUSFBOEOPEFMPDBM TUPSBHFT NJOJCBUDIHFOFSBUJPOT TUBHJOHUSBJOJOHEBUBTFUJOUPOPEFMPDBMTUPSBHF JONJOJCUBDIVOJUBOEQBSBMMFMM Small overhead! fast mini-batch generations! USBJOJOH 7
  8. i de li e e 1 i de li e

    e 2 Ti e e e e de e e PROPOSED METHOD (2/2) The basic idea of our proposed method is “Overlapping copying training dataset with mini-batch generation”. We designed it as a three stages pipeline in parallel, and implemented this pipeline by extending the original MultiprocessIteraoter. 8 HFOFSBUJOHJOEFYMJTUTTUBHF QSFGFUDIJOHEBUBTUBHF HFOFSBUJOHNJOJCBUDIFT TUBHF GFUDIEBUB
  9. ABOUT EVALUATION (1/2) We have conducted two experiments to evaluate

    the speed of the proposed method (mini-batch size is 32). 9 1. 1,000 mini-batches benchmark with ImageNet. • Evaluation for only mini-batch generations in 1,000 times. Storage Proposed method or MultiprocessIterator mini-batch queue (on memor ) Read training data Enqueue mini-batches For 1,000 mini-batches
  10. ABOUT EVALUATION (2/2) We have conducted two experiments to evaluate

    the speed of the proposed method (mini-batch size is 32). 10 2. Data parallel training of ResNet-50 for 2 epochs with ImageNet. • Evaluation for the end to end time in actual whole DNN training time. • Using 1 to 16 nodes, and 4 Training process for each node S o age P o o ed me hod o M l i oce I e a o mini-ba ch e e (on memo ) Read aining da a En e e mini-ba che Fo 2 e och (Abo 80,000 [I e a ion / T aining oce ] ) T aining oce Fe ch mini-ba che T aining oce T aining oce mini-ba ch e e (on memo ) mini-ba ch e e (on memo ) P o o ed me hod o M l i oce I e a o P o o ed me hod o M l i oce I e a o
  11. EVALUATION ENVIRONENT All evaluations are conducted at a HPC Cluster

    “Cygnus”, whose compute nodes have a NVMe SSD as a node-local storage and Lustre as a parallel file system. 11 CPU Intel Xeon Gold 6126 Processor (12C/2.6GHz) x 2 Memory 192GiB (16GiB DDR4-2666 ECC RDIMM x 12) Node-local storage Intel SSD DC P4610 Series 3.2TB x 1 Parallel file system Lustre (DDN EXAScaler) 2.5 PB in total GPU NVIDIA Tesla V100 32GiB HBM2 PCIe 3.0 x 4 OS CentOS Linux release 7.6.1810 Python 3.6.8 Compute node specification in Cygnus
  12. EVALUATION SETTINGS • We specified “forkserver” option as “start_method” of

    “multiprocessing.Process” • This option tends to increase the CPU overhead of process fork. • This is because there is a possibility of process crash when using MultiprocessIterator and InfiniBand (written in the document of Chainer) • https://docs.chainer.org/en/v6.5.0/chainermn/tutorial/ tips_faqs.html • All images we used are resized at 256 x 256
  13. 0 100 200 300 400 500 2 4 8 12

    16 Number of processes of "prefetcher process" Elapsed time [sec] dequeue an index list fetch data initialize others EVALUATION RESULT 1 (1/4) In the result of “1,000 mini-batches benchmark”, the elapsed time for mini- batch generations decreased as prefetcher process increases. => Parallelized prefetching can conceal the staging time and leverage node-local SSD. 13 The result of “1,000 mini-batches benchmark” of the proposed method with fixed 2 file reader processes. • With lower number of prefetcher processes, the time for prefetching is a main bottleneck. • With 8 or more prefetcher processes, the time for prefetching is almost concealed. dequeue an index list fetch data
  14. 0 100 200 300 400 500 2 4 8 12

    16 Number of "file reader process" Elapsed time [sec] proposed Lustre SSD EVALUATION RESULT 1 (2/4) • Compared with MultiprocessIterator with Lustre, proposed method is 1.38 times to 6.19 times faster. • Compared with MultiprocessIterator with SSD, proposed method is not so fast SSD, but actually the result of SSD do not staging time. 14 The result of “1,000 mini-batches benchmark” the proposed method + MultiprocessIterator (Lustre and SSD) • In this comparison, the number of “prefetcher process” is fixed to 12 in the proposed method. 6.19 times faster 1.38 times faster Proposed Lustre 44%
  15. EVALUATION 1 (3/4) The result of proposed method is almost

    same because of the time for “initialize” increased as the number of file reader processes increases. 15 The result of “1,000 mini-batches benchmark” of the proposed method with fixed 2 file reader processes. 0 100 200 300 400 500 2 4 8 12 16 Number of processes of "file reader process" Elapsed time [sec] dequeue an index list fetch data initialize others
  16. 0.0 0.3 0.6 0.9 1.2 0 250 500 750 1000

    Count of loaded mini−batches Elapsed time [sec] p2 p4 p8 EVALUATION 1 (4/4) In the beginning part of them, it took longer to dequeue an index list from index list 2. This is because in the beginning part no mini-batches are buffered and it took long to copy files of the first mini-batch. 16 The time of dequing index lists from index list queue inProposed method breakdown 10−4 10−3 10−2 10−1 0 250 500 750 1000 Count of loaded mini−batches Elapsed time [sec] p12 p16
  17. 0 2000 4000 6000 8000 10000 1 2 4 8

    16 Number of nodes (PPR=4) Elapsed time [sec] proposed Lustre SSD EVALUATION RESULT 2 Mini-batch generation throughput affects the overall time in actual data parallel training. => The performance of read I/O has a big influence on actual DNN training time. 17 1.74 times faster 1.26 times faster The result of “data parallel training for 2 epoch” the proposed method + MultiprocessIterator (Lustre and SSD) , with fixed 2 file reader processes, fixed 2 prefetcher processes Proposed Lustre 44%
  18. RELATED WORK • Entropy-Aware I/O Pipelining for Large-Scale Deep Learning

    on HPC Systems (Zhu et al., 2018 [2]) • Using local cache is similar point with our proposed method. • Characterizing Deep-Learning I/O Workloads in TensorFlow (Chowdhury et al., 2018 [3]) • Examination of the effects of parallelization of reading training data and buffering mini-batches in TensorFlow. Some studies proposed a method to accelerate reading of training data, and examined the effect of read I/O improvement function. 18 [2] Y. Zhu et al., Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2018. [3] Fahim Chowdhury et al., Characterizing Deep-Learning I/O Workloads in TensorFlow. In 2018 IEEE/ ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS), 2018.
  19. CONCLUSION • Our proposed method can accelerate reading of training

    data by overlapping staging and mini-batch generations. • Our proposed method has achieved 1.38 times to 6.19 times better bandwidth compared with reading the ImageNet dataset directly. • Future work: Implementations on other machine learning frameworks (e.g. TensorFlow, MXNet etc.) 19
  20. APPENDIX w 5IFNBOVTDSJQUPGUIJTTUVEZ w IUUQTEMBDNPSHDJUBUJPODGN JE w 5IFTPVSDFDPEFPGUIFSFGFSFODFJNQMFNFOUBUJPOPGUIJTTUVEZ w IUUQTHJUIVCDPNTFSJIJSPDIBJOFS@QSFGFUDI@NVMUJQSPDFTT@JUFSBUPS

    w "CPVUl$ZHOVTz UIF)1$DMVTUFSVTFEJOUIFFWBMVBUJPOTPGUIJTTUVEZ  w IUUQTXXXDDTUTVLVCBBDKQXQDPOUFOUVQMPBETTJUFT "CPVU$ZHOVTQEG w "CPVUl$IBJOFSz w IUUQTDIBJOFSPSH