Distributed Deep Learning with Chainer and Hadoop

E1923013dacab39eb231a2fffbf7b33c?s=47 UENISHI Kota
December 04, 2019

Distributed Deep Learning with Chainer and Hadoop

E1923013dacab39eb231a2fffbf7b33c?s=128

UENISHI Kota

December 04, 2019
Tweet

Transcript

  1. 2019/12/4 Hadoop SCR #27 Distributed Deep Learning with Chainer and

    Hadoop (How Fast Can Data be Delivered to DNNs?) Kota UENISHI Preferred Networks, Inc. 1
  2. Preferred Networks, Inc. Since 2014 Mission - Make the Real

    World Computable - Rapid Realization of Cutting-Edge Technologies - Robot for Everyone 2
  3. Our Storage System History 2017 MN-1 NFS: 3-5 storage systems

    running, ~200TB 2018 MN-1 NFS: 2 storage servers, ~400TB HDFS: 5 DataNodes, ~500TB 2019 MN-2 NFS: 2 HDD-based, 2 NVMe-based storage servers, ~600TB HDFS: 20 DataNodes, ~3PB 3 Disclaimer: numbers are very rough
  4. Why Hadoop? ❖ EcoSystem ➢ Troubleshooting by Googling ➢ Various

    operational tools ❖ Balance in newer architecture vs historical reliability ❖ Operational Ease for PFN ➢ User-land system ➢ Only part-time volunteers; no fulltime team 4
  5. Deep Learning 5 Forward Backward Optimize 推論 [fish, dog, …]

    正解 [cat, dog, …] loss [1.0, 0.0, …] update param grad Forward
  6. SGD with Minibatch 6 Shuffle Minibatch MNIST • 60k images

    • Typical batch size: 10~100 CIFAR10 • 60k images • Typical batch size: 64 ImageNet-1k • 1580k images • Typical batch size: 32 Open Images Dataset (OID) • 8M images • Typical batch size: ??? Repeat
  7. Distributed Deep Learning 7 All-Reduce w/NCCL Forward Forward Forward Backward

    Backward Backward Optimize Optimize Optimize Forward Backward Optimize
  8. Our Clusters 8 MN-1a MN-1b MN-2 GPU NVIDIA P100 (PCIe,

    12GB/16GB) NVIDIA V100 (PCIe, 32GB) NVIDIA V100 (SXM2, 32GB/16GB) GPUs 1024 512 1024 CPUs 2048 2304 5760 Interconnect InfiniBand FDR  x2 (128Gbps) InfiniBand EDR x2 (200Gbps) (RoCEv2) Network 1GbE 10GbE 100GbE x4 Local Storage ~800GB ~3.2TB ~3.2TB Catalog perf. 19.1PFLOPS 57.3PFLOPS 128PFLOPS Preferred Networks builds MN-2, a state-of-the-art supercomputer powered with NVIDIA GPUs.
  9. MN-1 9 © NTT Communications © NTT Communications

  10. Why We Build on-prem Clusters? A. Faster cycle of training

    & testing is crucial to the speed of our R&D. 10
  11. How Fast is ImageNet Delivered to Accelorators 11 Affiliation Duration

    Total per GPU/TPU Computer Facebook 1 hour [1] 39.5k images/sec 154.3 images/sec (Internal Cluster) PFN 15 minutes [2] 158k images/sec 154.3 images/sec MN-1a SONY 224 seconds [3] 635k images/sec 291.7 images/sec ABCI Google 88 seconds [4] 1.63M images/sec 1578.0 images/sec Internal TPU Pod Fujitsu 74.7 seconds [5] 1.90M images/sec 929.5 images/sec ABCI ImageNet: • (Originally) A WordNet-like annotated set of images • (Narrow context) A classification subset and task for 2012 and 2015 competition • (Narrower context) De facto DL performance benchmark
  12. How Fast Can You Deliver? 12 Dataset in Disks GPU

    Node Memory
  13. 1. Naively Read from NFS 13 Pros ❖ Easy ❖

    No porting ❖ Automatically cached Cons ❖ Not that fast ➢ >10^2ms on read ❖ NFS can easily be overloaded ❖ Not sustainable ➢ Capacity limited NFS
  14. 2. Copy to Local in Advance and Read Locally 14

    Pros ❖ Fastest and Easy ➢ 10^1~10^-1 ms on read depending on local media ❖ No porting ❖ Automatically cached ❖ I/O load distributed Cons ❖ Needs data distribution before computation ❖ Local disk space is limited ❖ Not sustainable
  15. 3. Super-scalable and Fast Storage 15 Pros ❖ Easy and

    fast ❖ No porting ❖ Automatically cached Cons ❖ No such storage ❖ Not a poor-man’s solution Super-fast storage
  16. 4. Local HDFS 16 Inspired by MapReduce I/O style Pros

    ❖ Fast by direct local read Cons ❖ It’s very difficult.
  17. Why Local HDFS is not Practical? 17 17 ❖ Amount

    of scattered data in each node is not equal in HDFS ❖ NCCL, de facto & key library for fast allreduce assumes equal size among GPUs ❖ Deploying kubelet and HDFS DataNode at the same host is not practical, too ❖ Difficult to allocate pod/GPU at where the data is; requires tight coupling of scheduler and data distribution.
  18. 5. Remote HDFS 18 Pros ❖ (Maybe) sustainable ❖ Almost

    infinite capacity ❖ Enough throughput ❖ Poor-man’s solution Cons ❖ Too many small files ❖ High latency ➢ 10^-1~10^3 seconds ❖ Needs application porting ❖ High latency ❖ ...and more!
  19. Mitigating Application Porting: ChainerIO ❖ API abstraction library to provide

    same code to local filesystem & HDFS ❖ Standing on the shoulder of PyArrow’s HDFS client 19 $ sudo apt-get install \ openjdk-8-jdk libhdfs0 $ pip install --user chainerio
  20. Mitigating High Latency 0/3: Prefetching • Chainer’s MultithreadIterator and MultiprocessIterator

    enables prefetching – Iterators know SGD's random choice of a batch in advance – A thread/process pool that prefetches a batch in a multiplexed way 20 main Pool Storage GPU With Prefetching main Storage GPU Without Prefetching
  21. Mitigating High Latency 1/3: ZIP • Put all images into

    one or a few large ZIP file – bonus: small-file problem resolved – cons: hard to make a ZIP • Without ZIP – Number of open ∝ number of images • With ZIP – Just a few open call 21 Client NN DN krb5 ➢ To read first byte of a file, 2 roundtrip in Hadoop protocol required, ➢ … which involve 2 TLS handshake in secure cluster ➢ … which involve 12 IP packet roundtrips in total ➢ +2x2 roundtrip of Kerberos ticket check ➢ But could be or couldn’t be the culprit
  22. Mitigating High Latency 2/3: Preload/Cache Chainer’s scatter_dataset assigns data into

    each GPU process. And ChainerIO’s FileCache enables caching the assigned data on local disk. Pros ❖ No cache cap by RAM size Cons ❖ First epoch still slow ➢ Preload ❖ Application porting still required 22 cache cache cache cache
  23. Typical Workload in MN-2 23 Job A Job B Job

    C Dataset ImageNet OID (detection) OID (full set) #of GPUs (nodes) 32~256 32 ~ 512 32~256 Training Time 1day ~ 30min days ~ hours days ~ hours #of epochs 90 10 50~100 preload ~hours ~hours ~hours
  24. Mitigating High Latency 3/3: More Ideas Coming Problems • Preloading

    overhead • Application portability on non-POSIX implementation • Python-JVM data copying • multiprocessing fork vs JVM • Huge dataset that doesn’t fit local storage Ideas • Original File Format Faster than ZIP • Ozone & Flash-based cluster • Distributed Caching • And more 24
  25. Summary ❖ Speed is crucial in machine learning R&D ❖

    DL workload is random read with low latency requirement ❖ Sustainability and Performance (esp. latency) are in trade off ❖ HDFS is in better balance on that trade off ❖ Workaround and mitigations to improve the trade off balance ➢ Abstraction Library, Caching and ZIP 25
  26. Questions? 26