Distributed Deep Learning with Chainer and Hadoop

2019/12/4 Hadoop SCR #27 Distributed Deep Learning with Chainer and
Hadoop (How Fast Can Data be Delivered to DNNs?) Kota UENISHI Preferred Networks, Inc. 1

Preferred Networks, Inc. Since 2014 Mission - Make the Real
World Computable - Rapid Realization of Cutting-Edge Technologies - Robot for Everyone 2

Our Storage System History 2017 MN-1 NFS: 3-5 storage systems
running, ~200TB 2018 MN-1 NFS: 2 storage servers, ~400TB HDFS: 5 DataNodes, ~500TB 2019 MN-2 NFS: 2 HDD-based, 2 NVMe-based storage servers, ~600TB HDFS: 20 DataNodes, ~3PB 3 Disclaimer: numbers are very rough

Why Hadoop? ❖ EcoSystem ➢ Troubleshooting by Googling ➢ Various
operational tools ❖ Balance in newer architecture vs historical reliability ❖ Operational Ease for PFN ➢ User-land system ➢ Only part-time volunteers; no fulltime team 4

Deep Learning 5 Forward Backward Optimize 推論 [fish, dog, …]
正解 [cat, dog, …] loss [1.0, 0.0, …] update param grad Forward

SGD with Minibatch 6 Shuﬀle Minibatch MNIST • 60k images
• Typical batch size: 10~100 CIFAR10 • 60k images • Typical batch size: 64 ImageNet-1k • 1580k images • Typical batch size: 32 Open Images Dataset (OID) • 8M images • Typical batch size: ??? Repeat

Distributed Deep Learning 7 All-Reduce w/NCCL Forward Forward Forward Backward
Backward Backward Optimize Optimize Optimize Forward Backward Optimize

Our Clusters 8 MN-1a MN-1b MN-2 GPU NVIDIA P100 (PCIe,
12GB/16GB) NVIDIA V100 (PCIe, 32GB) NVIDIA V100 (SXM2, 32GB/16GB) GPUs 1024 512 1024 CPUs 2048 2304 5760 Interconnect InfiniBand FDR　 x2 (128Gbps) InfiniBand EDR x2 (200Gbps) (RoCEv2) Network 1GbE 10GbE 100GbE x4 Local Storage ~800GB ~3.2TB ~3.2TB Catalog perf. 19.1PFLOPS 57.3PFLOPS 128PFLOPS Preferred Networks builds MN-2, a state-of-the-art supercomputer powered with NVIDIA GPUs.

MN-1 9 © NTT Communications © NTT Communications

Why We Build on-prem Clusters? A. Faster cycle of training
& testing is crucial to the speed of our R&D. 10

How Fast is ImageNet Delivered to Accelorators 11 Aﬀiliation Duration
Total per GPU/TPU Computer Facebook 1 hour [1] 39.5k images/sec 154.3 images/sec (Internal Cluster) PFN 15 minutes [2] 158k images/sec 154.3 images/sec MN-1a SONY 224 seconds [3] 635k images/sec 291.7 images/sec ABCI Google 88 seconds [4] 1.63M images/sec 1578.0 images/sec Internal TPU Pod Fujitsu 74.7 seconds [5] 1.90M images/sec 929.5 images/sec ABCI ImageNet: • (Originally) A WordNet-like annotated set of images • (Narrow context) A classification subset and task for 2012 and 2015 competition • (Narrower context) De facto DL performance benchmark

How Fast Can You Deliver? 12 Dataset in Disks GPU
Node Memory

1. Naively Read from NFS 13 Pros ❖ Easy ❖
No porting ❖ Automatically cached Cons ❖ Not that fast ➢ >10^2ms on read ❖ NFS can easily be overloaded ❖ Not sustainable ➢ Capacity limited NFS

2. Copy to Local in Advance and Read Locally 14
Pros ❖ Fastest and Easy ➢ 10^1~10^-1 ms on read depending on local media ❖ No porting ❖ Automatically cached ❖ I/O load distributed Cons ❖ Needs data distribution before computation ❖ Local disk space is limited ❖ Not sustainable

3. Super-scalable and Fast Storage 15 Pros ❖ Easy and
fast ❖ No porting ❖ Automatically cached Cons ❖ No such storage ❖ Not a poor-man’s solution Super-fast storage

4. Local HDFS 16 Inspired by MapReduce I/O style Pros
❖ Fast by direct local read Cons ❖ It’s very diﬀicult.

Why Local HDFS is not Practical? 17 17 ❖ Amount
of scattered data in each node is not equal in HDFS ❖ NCCL, de facto & key library for fast allreduce assumes equal size among GPUs ❖ Deploying kubelet and HDFS DataNode at the same host is not practical, too ❖ Diﬀicult to allocate pod/GPU at where the data is; requires tight coupling of scheduler and data distribution.

5. Remote HDFS 18 Pros ❖ (Maybe) sustainable ❖ Almost
infinite capacity ❖ Enough throughput ❖ Poor-man’s solution Cons ❖ Too many small files ❖ High latency ➢ 10^-1~10^3 seconds ❖ Needs application porting ❖ High latency ❖ ...and more!

Mitigating Application Porting: ChainerIO ❖ API abstraction library to provide
same code to local filesystem & HDFS ❖ Standing on the shoulder of PyArrow’s HDFS client 19 $ sudo apt-get install \ openjdk-8-jdk libhdfs0 $ pip install --user chainerio

Mitigating High Latency 0/3: Prefetching • Chainer’s MultithreadIterator and MultiprocessIterator
enables prefetching – Iterators know SGD's random choice of a batch in advance – A thread/process pool that prefetches a batch in a multiplexed way 20 main Pool Storage GPU With Prefetching main Storage GPU Without Prefetching

Mitigating High Latency 1/3: ZIP • Put all images into
one or a few large ZIP file – bonus: small-file problem resolved – cons: hard to make a ZIP • Without ZIP – Number of open ∝ number of images • With ZIP – Just a few open call 21 Client NN DN krb5 ➢ To read first byte of a file, 2 roundtrip in Hadoop protocol required, ➢ … which involve 2 TLS handshake in secure cluster ➢ … which involve 12 IP packet roundtrips in total ➢ +2x2 roundtrip of Kerberos ticket check ➢ But could be or couldn’t be the culprit

Mitigating High Latency 2/3: Preload/Cache Chainer’s scatter_dataset assigns data into
each GPU process. And ChainerIO’s FileCache enables caching the assigned data on local disk. Pros ❖ No cache cap by RAM size Cons ❖ First epoch still slow ➢ Preload ❖ Application porting still required 22 cache cache cache cache

Typical Workload in MN-2 23 Job A Job B Job
C Dataset ImageNet OID (detection) OID (full set) #of GPUs (nodes) 32~256 32 ~ 512 32~256 Training Time 1day ~ 30min days ~ hours days ~ hours #of epochs 90 10 50~100 preload ~hours ~hours ~hours

Mitigating High Latency 3/3: More Ideas Coming Problems • Preloading
overhead • Application portability on non-POSIX implementation • Python-JVM data copying • multiprocessing fork vs JVM • Huge dataset that doesn’t fit local storage Ideas • Original File Format Faster than ZIP • Ozone & Flash-based cluster • Distributed Caching • And more 24

Summary ❖ Speed is crucial in machine learning R&D ❖
DL workload is random read with low latency requirement ❖ Sustainability and Performance (esp. latency) are in trade off ❖ HDFS is in better balance on that trade off ❖ Workaround and mitigations to improve the trade off balance ➢ Abstraction Library, Caching and ZIP 25

Questions? 26

Distributed Deep Learning with Chainer and Hadoop

Distributed Deep Learning with Chainer and Hadoop

UENISHI Kota

More Decks by UENISHI Kota

Other Decks in Technology

Featured

Transcript

2019/12/4 Hadoop SCR #27 Distributed Deep Learning with Chainer and

Preferred Networks, Inc. Since 2014 Mission - Make the Real

Our Storage System History 2017 MN-1 NFS: 3-5 storage systems

Why Hadoop? ❖ EcoSystem ➢ Troubleshooting by Googling ➢ Various

Deep Learning 5 Forward Backward Optimize 推論 [fish, dog, …]

SGD with Minibatch 6 Shuﬀle Minibatch MNIST • 60k images

Distributed Deep Learning 7 All-Reduce w/NCCL Forward Forward Forward Backward

Our Clusters 8 MN-1a MN-1b MN-2 GPU NVIDIA P100 (PCIe,

MN-1 9 © NTT Communications © NTT Communications

Why We Build on-prem Clusters? A. Faster cycle of training

How Fast is ImageNet Delivered to Accelorators 11 Aﬀiliation Duration

How Fast Can You Deliver? 12 Dataset in Disks GPU

1. Naively Read from NFS 13 Pros ❖ Easy ❖

2. Copy to Local in Advance and Read Locally 14

3. Super-scalable and Fast Storage 15 Pros ❖ Easy and

4. Local HDFS 16 Inspired by MapReduce I/O style Pros

Why Local HDFS is not Practical? 17 17 ❖ Amount

5. Remote HDFS 18 Pros ❖ (Maybe) sustainable ❖ Almost

Mitigating Application Porting: ChainerIO ❖ API abstraction library to provide

Mitigating High Latency 0/3: Prefetching • Chainer’s MultithreadIterator and MultiprocessIterator

Mitigating High Latency 1/3: ZIP • Put all images into

Mitigating High Latency 2/3: Preload/Cache Chainer’s scatter_dataset assigns data into

Typical Workload in MN-2 23 Job A Job B Job

Mitigating High Latency 3/3: More Ideas Coming Problems • Preloading

Summary ❖ Speed is crucial in machine learning R&D ❖

Questions? 26