operational tools ❖ Balance in newer architecture vs historical reliability ❖ Operational Ease for PFN ➢ User-land system ➢ Only part-time volunteers; no fulltime team 4
Pros ❖ Fastest and Easy ➢ 10^1~10^-1 ms on read depending on local media ❖ No porting ❖ Automatically cached ❖ I/O load distributed Cons ❖ Needs data distribution before computation ❖ Local disk space is limited ❖ Not sustainable
of scattered data in each node is not equal in HDFS ❖ NCCL, de facto & key library for fast allreduce assumes equal size among GPUs ❖ Deploying kubelet and HDFS DataNode at the same host is not practical, too ❖ Difficult to allocate pod/GPU at where the data is; requires tight coupling of scheduler and data distribution.
same code to local filesystem & HDFS ❖ Standing on the shoulder of PyArrow’s HDFS client 19 $ sudo apt-get install \ openjdk-8-jdk libhdfs0 $ pip install --user chainerio
enables prefetching – Iterators know SGD's random choice of a batch in advance – A thread/process pool that prefetches a batch in a multiplexed way 20 main Pool Storage GPU With Prefetching main Storage GPU Without Prefetching
one or a few large ZIP file – bonus: small-file problem resolved – cons: hard to make a ZIP • Without ZIP – Number of open ∝ number of images • With ZIP – Just a few open call 21 Client NN DN krb5 ➢ To read first byte of a file, 2 roundtrip in Hadoop protocol required, ➢ … which involve 2 TLS handshake in secure cluster ➢ … which involve 12 IP packet roundtrips in total ➢ +2x2 roundtrip of Kerberos ticket check ➢ But could be or couldn’t be the culprit
each GPU process. And ChainerIO’s FileCache enables caching the assigned data on local disk. Pros ❖ No cache cap by RAM size Cons ❖ First epoch still slow ➢ Preload ❖ Application porting still required 22 cache cache cache cache
overhead • Application portability on non-POSIX implementation • Python-JVM data copying • multiprocessing fork vs JVM • Huge dataset that doesn’t fit local storage Ideas • Original File Format Faster than ZIP • Ozone & Flash-based cluster • Distributed Caching • And more 24
DL workload is random read with low latency requirement ❖ Sustainability and Performance (esp. latency) are in trade off ❖ HDFS is in better balance on that trade off ❖ Workaround and mitigations to improve the trade off balance ➢ Abstraction Library, Caching and ZIP 25