Distributed Deep Learning with Chainer and Hadoop

Slide 1

Slide 1 text

2019/12/4 Hadoop SCR #27 Distributed Deep Learning with Chainer and Hadoop (How Fast Can Data be Delivered to DNNs?) Kota UENISHI Preferred Networks, Inc. 1

Slide 2

Slide 2 text

Preferred Networks, Inc. Since 2014 Mission - Make the Real World Computable - Rapid Realization of Cutting-Edge Technologies - Robot for Everyone 2

Slide 3

Slide 3 text

Our Storage System History 2017 MN-1 NFS: 3-5 storage systems running, ~200TB 2018 MN-1 NFS: 2 storage servers, ~400TB HDFS: 5 DataNodes, ~500TB 2019 MN-2 NFS: 2 HDD-based, 2 NVMe-based storage servers, ~600TB HDFS: 20 DataNodes, ~3PB 3 Disclaimer: numbers are very rough

Slide 4

Slide 4 text

Why Hadoop? ❖ EcoSystem ➢ Troubleshooting by Googling ➢ Various operational tools ❖ Balance in newer architecture vs historical reliability ❖ Operational Ease for PFN ➢ User-land system ➢ Only part-time volunteers; no fulltime team 4

Slide 5

Slide 5 text

Deep Learning 5 Forward Backward Optimize 推論 [fish, dog, …] 正解 [cat, dog, …] loss [1.0, 0.0, …] update param grad Forward

Slide 6

Slide 6 text

SGD with Minibatch 6 Shuﬀle Minibatch MNIST ● 60k images ● Typical batch size: 10~100 CIFAR10 ● 60k images ● Typical batch size: 64 ImageNet-1k ● 1580k images ● Typical batch size: 32 Open Images Dataset (OID) ● 8M images ● Typical batch size: ??? Repeat

Slide 7

Slide 7 text

Distributed Deep Learning 7 All-Reduce w/NCCL Forward Forward Forward Backward Backward Backward Optimize Optimize Optimize Forward Backward Optimize

Slide 8

Slide 8 text

Our Clusters 8 MN-1a MN-1b MN-2 GPU NVIDIA P100 (PCIe, 12GB/16GB) NVIDIA V100 (PCIe, 32GB) NVIDIA V100 (SXM2, 32GB/16GB) GPUs 1024 512 1024 CPUs 2048 2304 5760 Interconnect InfiniBand FDR　 x2 (128Gbps) InfiniBand EDR x2 (200Gbps) (RoCEv2) Network 1GbE 10GbE 100GbE x4 Local Storage ~800GB ~3.2TB ~3.2TB Catalog perf. 19.1PFLOPS 57.3PFLOPS 128PFLOPS Preferred Networks builds MN-2, a state-of-the-art supercomputer powered with NVIDIA GPUs.

Slide 9

Slide 9 text

MN-1 9 © NTT Communications © NTT Communications

Slide 10

Slide 10 text

Why We Build on-prem Clusters? A. Faster cycle of training & testing is crucial to the speed of our R&D. 10

Slide 11

Slide 11 text

How Fast is ImageNet Delivered to Accelorators 11 Aﬀiliation Duration Total per GPU/TPU Computer Facebook 1 hour [1] 39.5k images/sec 154.3 images/sec (Internal Cluster) PFN 15 minutes [2] 158k images/sec 154.3 images/sec MN-1a SONY 224 seconds [3] 635k images/sec 291.7 images/sec ABCI Google 88 seconds [4] 1.63M images/sec 1578.0 images/sec Internal TPU Pod Fujitsu 74.7 seconds [5] 1.90M images/sec 929.5 images/sec ABCI ImageNet: ● (Originally) A WordNet-like annotated set of images ● (Narrow context) A classification subset and task for 2012 and 2015 competition ● (Narrower context) De facto DL performance benchmark

Slide 12

Slide 12 text

How Fast Can You Deliver? 12 Dataset in Disks GPU Node Memory

Slide 13

Slide 13 text

1. Naively Read from NFS 13 Pros ❖ Easy ❖ No porting ❖ Automatically cached Cons ❖ Not that fast ➢ >10^2ms on read ❖ NFS can easily be overloaded ❖ Not sustainable ➢ Capacity limited NFS

Slide 14

Slide 14 text

2. Copy to Local in Advance and Read Locally 14 Pros ❖ Fastest and Easy ➢ 10^1~10^-1 ms on read depending on local media ❖ No porting ❖ Automatically cached ❖ I/O load distributed Cons ❖ Needs data distribution before computation ❖ Local disk space is limited ❖ Not sustainable

Slide 15

Slide 15 text

3. Super-scalable and Fast Storage 15 Pros ❖ Easy and fast ❖ No porting ❖ Automatically cached Cons ❖ No such storage ❖ Not a poor-man’s solution Super-fast storage

Slide 16

Slide 16 text

4. Local HDFS 16 Inspired by MapReduce I/O style Pros ❖ Fast by direct local read Cons ❖ It’s very diﬀicult.

Slide 17

Slide 17 text

Why Local HDFS is not Practical? 17 17 ❖ Amount of scattered data in each node is not equal in HDFS ❖ NCCL, de facto & key library for fast allreduce assumes equal size among GPUs ❖ Deploying kubelet and HDFS DataNode at the same host is not practical, too ❖ Diﬀicult to allocate pod/GPU at where the data is; requires tight coupling of scheduler and data distribution.

Slide 18

Slide 18 text

5. Remote HDFS 18 Pros ❖ (Maybe) sustainable ❖ Almost infinite capacity ❖ Enough throughput ❖ Poor-man’s solution Cons ❖ Too many small files ❖ High latency ➢ 10^-1~10^3 seconds ❖ Needs application porting ❖ High latency ❖ ...and more!

Slide 19

Slide 19 text

Mitigating Application Porting: ChainerIO ❖ API abstraction library to provide same code to local filesystem & HDFS ❖ Standing on the shoulder of PyArrow’s HDFS client 19 $ sudo apt-get install \ openjdk-8-jdk libhdfs0 $ pip install --user chainerio

Slide 20

Slide 20 text

Mitigating High Latency 0/3: Prefetching • Chainer’s MultithreadIterator and MultiprocessIterator enables prefetching – Iterators know SGD's random choice of a batch in advance – A thread/process pool that prefetches a batch in a multiplexed way 20 main Pool Storage GPU With Prefetching main Storage GPU Without Prefetching

Slide 21

Slide 21 text

Mitigating High Latency 1/3: ZIP • Put all images into one or a few large ZIP file – bonus: small-file problem resolved – cons: hard to make a ZIP • Without ZIP – Number of open ∝ number of images • With ZIP – Just a few open call 21 Client NN DN krb5 ➢ To read first byte of a file, 2 roundtrip in Hadoop protocol required, ➢ … which involve 2 TLS handshake in secure cluster ➢ … which involve 12 IP packet roundtrips in total ➢ +2x2 roundtrip of Kerberos ticket check ➢ But could be or couldn’t be the culprit

Slide 22

Slide 22 text

Mitigating High Latency 2/3: Preload/Cache Chainer’s scatter_dataset assigns data into each GPU process. And ChainerIO’s FileCache enables caching the assigned data on local disk. Pros ❖ No cache cap by RAM size Cons ❖ First epoch still slow ➢ Preload ❖ Application porting still required 22 cache cache cache cache

Slide 23

Slide 23 text

Typical Workload in MN-2 23 Job A Job B Job C Dataset ImageNet OID (detection) OID (full set) #of GPUs (nodes) 32~256 32 ~ 512 32~256 Training Time 1day ~ 30min days ~ hours days ~ hours #of epochs 90 10 50~100 preload ~hours ~hours ~hours

Slide 24

Slide 24 text

Mitigating High Latency 3/3: More Ideas Coming Problems • Preloading overhead • Application portability on non-POSIX implementation • Python-JVM data copying • multiprocessing fork vs JVM • Huge dataset that doesn’t fit local storage Ideas • Original File Format Faster than ZIP • Ozone & Flash-based cluster • Distributed Caching • And more 24

Slide 25

Slide 25 text

Summary ❖ Speed is crucial in machine learning R&D ❖ DL workload is random read with low latency requirement ❖ Sustainability and Performance (esp. latency) are in trade off ❖ HDFS is in better balance on that trade off ❖ Workaround and mitigations to improve the trade off balance ➢ Abstraction Library, Caching and ZIP 25

Slide 26

Slide 26 text

Questions? 26