Slide 1

Slide 1 text

Apache Ozone behind Simulation and AI Industries Kota Uenishi Preferred Networks, Inc. 2022.07.29-31

Slide 2

Slide 2 text

Why Apache Ozone for AI systems? • S3 API • Data is mostly images, audio & movies • Small files problem in HDFS well addressed • Deep learning fits well with non-JVM interface: • Python (& C++) • MPI • GPU • Power of community from Hadoop • See also: A Year with Apache Ozone

Slide 3

Slide 3 text

Storage-Centric View of Our Workload Storage Cluster Compute Cluster Writes: Data Generation by Simulation ● Scientific Simulations ● Scene Generation Reads: Model Training with Generated Data Real Data Smarter Models

Slide 4

Slide 4 text

Writes: Data Generation by Simulation

Slide 5

Slide 5 text

High-res 3D model scan & scene generation High-res models generated by 3D scan Actual photo Models rotation (low-res) Mesh of 3D models https://www.preferred.jp/en/news/pr20210310/

Slide 6

Slide 6 text

Training data generated by computer graphics Huge number of training data with huge variation by placing the 3D models

Slide 7

Slide 7 text

Workflow of Scene Generation Typical cycle of scene generation & model training ● New items 3D-scanned & added to the database ● Generate millions of scene images within a few days ● Train models for another few days with those data ● Check the precision 3D models Scanned Scene Images Generated Train the DL model Check the Quality

Slide 8

Slide 8 text

If we store 1 image file / scene… Cons for 1 image file/scene • Speed: Ozone Manager is the bottleneck on putting & listing • Quantity: Close to a billion • Portability: Checksum maintenance made harder Pros • Performance: optimization in DNs

Slide 9

Slide 9 text

Small Files Problem w/billion keys: ListKeys Listing Keys overwhelmed 78% of the overall full-burning CPU cores of OM listKeys

Slide 10

Slide 10 text

Workaround for ListKeys: rate limiting Guava's RateLimiter was easy to use ~800 listKeys/sec => 256 (Commit: a2654d2e in github.com/pfnet/ozone )

Slide 11

Slide 11 text

Small Files Problem Next level: Put Keys Observed peak: ~1k put/sec Possible causes: ● "Giant" write lock on buckets in OzoneManager ● Slow data put in datanodes (they've HDDs) ● Per-connection authentication (HDDS-4440) Our workaround: ● Reduce files in apps (e.g. keep wider image and crop/split it on use) ● Make an archive per >10k files and reduce overhead Cost of ZIP archive: ● No benefit of small files optimization in DN ● Slow random reads on Training

Slide 12

Slide 12 text

What's Wrong with Archived ZIP files? Pros • Maintain a checksum with the archive; easy & safe • ZIP is de-facto standard; Python support with standard library Cons • Read latency is x4 longer per file than flat files; very bad for DL training • No small file optimization So you might think of prefetching & pipelining a ZIP ….

Slide 13

Slide 13 text

Reads: Training AI Models with Generated Data

Slide 14

Slide 14 text

Deep Learning is Essentially Random Read for (e=0; e < 90; e++): # Shuffle images = shuffle(images) for (image, label) in images: # Random read img = image.open().read() # Train model in GPU loss = error(label, model(img)) model.backprop(loss) Shuffle Split Feed Repeat Images sampled and composed from ImageNet-1000 dataset

Slide 15

Slide 15 text

Numbers in Training • Data delivery to GPUs requires ~1000 images/sec/GPU • Our observations: ~7k images/sec random read for flat • ~9GB/sec at disk IO • Only able to feed 7 GPUs in naive manner (^^; Workaround: cache data in node-local storage or in a shared NGINX server

Slide 16

Slide 16 text

Random Read Benchmark • Repeated random read of a 1.28M dataset twice for… • Archived ZIP file on local FS • Flat image files on Ozone • Archived ZIP file on Ozone • Examples after 1.28M are cached in local storage and reading them are fast • Latency fluctuation is larger in ZIP

Slide 17

Slide 17 text

Storage Cluster Spec Ozone cluster • 12 DataNodes (&S3G) • 14TB 7200rpm HDDs x36/node • 512GB RAM/node • 20core Intel CPU • 100GbE x2 • Java 11 • Ubuntu Linux 18.04 Benchmark Client • Single proc in a node • 15GB Gen3 NVMe x4 • 384GB RAM/node • 20core Intel CPU • Python 3.8 • Ubuntu Linux 18.04

Slide 18

Slide 18 text

Resolution: Sparse File Cache The 16MB buffer works as implicit prefetch for random read,without moving the spindle in the drives. The latency penalty is x4; ~0.1s vs ~0.4s Work in progress

Slide 19

Slide 19 text

Sparse files made it faster as local • You might not see the difference very well…

Slide 20

Slide 20 text

Sparse files made it faster as local The log scale graph indicates: • Latency of first ~5% is in same level as other Ozone cases • After ~5% data read, the cache is warmed up faster than local NVMe drive (+kernel cache) and the latency is at the same level

Slide 21

Slide 21 text

Summary • We generate & maintain billions of image files for AI training • Small files problem resurrected in a form of: • Slow listkeys: workaround by rate limiting (+FSO) • Slow PUT: workaround by archiving >10k files as one ZIP • Slow random read for an archive file • Will be worked around by sparse file cache

Slide 22

Slide 22 text

Thanks

Slide 23

Slide 23 text

FAQ Q. Sparse file technique is not new; like we have in s3-fuse. A. We chose Python-native method to avoid fuse overhead. Q. Image size distribution A. See right graph Q. Any other workaround cache? A. Nginx cache server via HTTP Q. Why sparse file cache warms up faster than kernel cache? A. The "page" size is larger (4KB vs 16MB) Q. Other simulation use cases? A. We published a paper in Chemistry area