Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Deep Learning with Chainer and Hadoop

UENISHI Kota
December 04, 2019

Distributed Deep Learning with Chainer and Hadoop

UENISHI Kota

December 04, 2019
Tweet

More Decks by UENISHI Kota

Other Decks in Technology

Transcript

  1. 2019/12/4 Hadoop SCR #27 Distributed Deep Learning with Chainer and

    Hadoop (How Fast Can Data be Delivered to DNNs?) Kota UENISHI Preferred Networks, Inc. 1
  2. Preferred Networks, Inc. Since 2014 Mission - Make the Real

    World Computable - Rapid Realization of Cutting-Edge Technologies - Robot for Everyone 2
  3. Our Storage System History 2017 MN-1 NFS: 3-5 storage systems

    running, ~200TB 2018 MN-1 NFS: 2 storage servers, ~400TB HDFS: 5 DataNodes, ~500TB 2019 MN-2 NFS: 2 HDD-based, 2 NVMe-based storage servers, ~600TB HDFS: 20 DataNodes, ~3PB 3 Disclaimer: numbers are very rough
  4. Why Hadoop? ❖ EcoSystem ➢ Troubleshooting by Googling ➢ Various

    operational tools ❖ Balance in newer architecture vs historical reliability ❖ Operational Ease for PFN ➢ User-land system ➢ Only part-time volunteers; no fulltime team 4
  5. Deep Learning 5 Forward Backward Optimize 推論 [fish, dog, …]

    正解 [cat, dog, …] loss [1.0, 0.0, …] update param grad Forward
  6. SGD with Minibatch 6 Shuffle Minibatch MNIST • 60k images

    • Typical batch size: 10~100 CIFAR10 • 60k images • Typical batch size: 64 ImageNet-1k • 1580k images • Typical batch size: 32 Open Images Dataset (OID) • 8M images • Typical batch size: ??? Repeat
  7. Distributed Deep Learning 7 All-Reduce w/NCCL Forward Forward Forward Backward

    Backward Backward Optimize Optimize Optimize Forward Backward Optimize
  8. Our Clusters 8 MN-1a MN-1b MN-2 GPU NVIDIA P100 (PCIe,

    12GB/16GB) NVIDIA V100 (PCIe, 32GB) NVIDIA V100 (SXM2, 32GB/16GB) GPUs 1024 512 1024 CPUs 2048 2304 5760 Interconnect InfiniBand FDR  x2 (128Gbps) InfiniBand EDR x2 (200Gbps) (RoCEv2) Network 1GbE 10GbE 100GbE x4 Local Storage ~800GB ~3.2TB ~3.2TB Catalog perf. 19.1PFLOPS 57.3PFLOPS 128PFLOPS Preferred Networks builds MN-2, a state-of-the-art supercomputer powered with NVIDIA GPUs.
  9. Why We Build on-prem Clusters? A. Faster cycle of training

    & testing is crucial to the speed of our R&D. 10
  10. How Fast is ImageNet Delivered to Accelorators 11 Affiliation Duration

    Total per GPU/TPU Computer Facebook 1 hour [1] 39.5k images/sec 154.3 images/sec (Internal Cluster) PFN 15 minutes [2] 158k images/sec 154.3 images/sec MN-1a SONY 224 seconds [3] 635k images/sec 291.7 images/sec ABCI Google 88 seconds [4] 1.63M images/sec 1578.0 images/sec Internal TPU Pod Fujitsu 74.7 seconds [5] 1.90M images/sec 929.5 images/sec ABCI ImageNet: • (Originally) A WordNet-like annotated set of images • (Narrow context) A classification subset and task for 2012 and 2015 competition • (Narrower context) De facto DL performance benchmark
  11. 1. Naively Read from NFS 13 Pros ❖ Easy ❖

    No porting ❖ Automatically cached Cons ❖ Not that fast ➢ >10^2ms on read ❖ NFS can easily be overloaded ❖ Not sustainable ➢ Capacity limited NFS
  12. 2. Copy to Local in Advance and Read Locally 14

    Pros ❖ Fastest and Easy ➢ 10^1~10^-1 ms on read depending on local media ❖ No porting ❖ Automatically cached ❖ I/O load distributed Cons ❖ Needs data distribution before computation ❖ Local disk space is limited ❖ Not sustainable
  13. 3. Super-scalable and Fast Storage 15 Pros ❖ Easy and

    fast ❖ No porting ❖ Automatically cached Cons ❖ No such storage ❖ Not a poor-man’s solution Super-fast storage
  14. 4. Local HDFS 16 Inspired by MapReduce I/O style Pros

    ❖ Fast by direct local read Cons ❖ It’s very difficult.
  15. Why Local HDFS is not Practical? 17 17 ❖ Amount

    of scattered data in each node is not equal in HDFS ❖ NCCL, de facto & key library for fast allreduce assumes equal size among GPUs ❖ Deploying kubelet and HDFS DataNode at the same host is not practical, too ❖ Difficult to allocate pod/GPU at where the data is; requires tight coupling of scheduler and data distribution.
  16. 5. Remote HDFS 18 Pros ❖ (Maybe) sustainable ❖ Almost

    infinite capacity ❖ Enough throughput ❖ Poor-man’s solution Cons ❖ Too many small files ❖ High latency ➢ 10^-1~10^3 seconds ❖ Needs application porting ❖ High latency ❖ ...and more!
  17. Mitigating Application Porting: ChainerIO ❖ API abstraction library to provide

    same code to local filesystem & HDFS ❖ Standing on the shoulder of PyArrow’s HDFS client 19 $ sudo apt-get install \ openjdk-8-jdk libhdfs0 $ pip install --user chainerio
  18. Mitigating High Latency 0/3: Prefetching • Chainer’s MultithreadIterator and MultiprocessIterator

    enables prefetching – Iterators know SGD's random choice of a batch in advance – A thread/process pool that prefetches a batch in a multiplexed way 20 main Pool Storage GPU With Prefetching main Storage GPU Without Prefetching
  19. Mitigating High Latency 1/3: ZIP • Put all images into

    one or a few large ZIP file – bonus: small-file problem resolved – cons: hard to make a ZIP • Without ZIP – Number of open ∝ number of images • With ZIP – Just a few open call 21 Client NN DN krb5 ➢ To read first byte of a file, 2 roundtrip in Hadoop protocol required, ➢ … which involve 2 TLS handshake in secure cluster ➢ … which involve 12 IP packet roundtrips in total ➢ +2x2 roundtrip of Kerberos ticket check ➢ But could be or couldn’t be the culprit
  20. Mitigating High Latency 2/3: Preload/Cache Chainer’s scatter_dataset assigns data into

    each GPU process. And ChainerIO’s FileCache enables caching the assigned data on local disk. Pros ❖ No cache cap by RAM size Cons ❖ First epoch still slow ➢ Preload ❖ Application porting still required 22 cache cache cache cache
  21. Typical Workload in MN-2 23 Job A Job B Job

    C Dataset ImageNet OID (detection) OID (full set) #of GPUs (nodes) 32~256 32 ~ 512 32~256 Training Time 1day ~ 30min days ~ hours days ~ hours #of epochs 90 10 50~100 preload ~hours ~hours ~hours
  22. Mitigating High Latency 3/3: More Ideas Coming Problems • Preloading

    overhead • Application portability on non-POSIX implementation • Python-JVM data copying • multiprocessing fork vs JVM • Huge dataset that doesn’t fit local storage Ideas • Original File Format Faster than ZIP • Ozone & Flash-based cluster • Distributed Caching • And more 24
  23. Summary ❖ Speed is crucial in machine learning R&D ❖

    DL workload is random read with low latency requirement ❖ Sustainability and Performance (esp. latency) are in trade off ❖ HDFS is in better balance on that trade off ❖ Workaround and mitigations to improve the trade off balance ➢ Abstraction Library, Caching and ZIP 25