Data is mostly images, audio & movies • Small files problem in HDFS well addressed • Deep learning fits well with non-JVM interface: • Python (& C++) • MPI • GPU • Power of community from Hadoop • See also: A Year with Apache Ozone
model training • New items 3D-scanned & added to the database • Generate millions of scene images within a few days • Train models for another few days with those data • Check the precision 3D models Scanned Scene Images Generated Train the DL model Check the Quality
1 image file/scene • Speed: Ozone Manager is the bottleneck on putting & listing • Quantity: Close to a billion • Portability: Checksum maintenance made harder Pros • Performance: optimization in DNs
put/sec Possible causes: • "Giant" write lock on buckets in OzoneManager • Slow data put in datanodes (they've HDDs) • Per-connection authentication (HDDS-4440) Our workaround: • Reduce files in apps (e.g. keep wider image and crop/split it on use) • Make an archive per >10k files and reduce overhead Cost of ZIP archive: • No benefit of small files optimization in DN • Slow random reads on Training
checksum with the archive; easy & safe • ZIP is de-facto standard; Python support with standard library Cons • Read latency is x4 longer per file than flat files; very bad for DL training • No small file optimization So you might think of prefetching & pipelining a ZIP ….
90; e++): # Shuffle images = shuffle(images) for (image, label) in images: # Random read img = image.open().read() # Train model in GPU loss = error(label, model(img)) model.backprop(loss) Shuffle Split Feed Repeat Images sampled and composed from ImageNet-1000 dataset
images/sec/GPU • Our observations: ~7k images/sec random read for flat • ~9GB/sec at disk IO • Only able to feed 7 GPUs in naive manner (^^; Workaround: cache data in node-local storage or in a shared NGINX server
dataset twice for… • Archived ZIP file on local FS • Flat image files on Ozone • Archived ZIP file on Ozone • Examples after 1.28M are cached in local storage and reading them are fast • Latency fluctuation is larger in ZIP
graph indicates: • Latency of first ~5% is in same level as other Ozone cases • After ~5% data read, the cache is warmed up faster than local NVMe drive (+kernel cache) and the latency is at the same level
for AI training • Small files problem resurrected in a form of: • Slow listkeys: workaround by rate limiting (+FSO) • Slow PUT: workaround by archiving >10k files as one ZIP • Slow random read for an archive file • Will be worked around by sparse file cache
have in s3-fuse. A. We chose Python-native method to avoid fuse overhead. Q. Image size distribution A. See right graph Q. Any other workaround cache? A. Nginx cache server via HTTP Q. Why sparse file cache warms up faster than kernel cache? A. The "page" size is larger (4KB vs 16MB) Q. Other simulation use cases? A. We published a paper in Chemistry area