Apache Ozone behind Simulation and AI Industries

Apache Ozone behind Simulation and AI Industries Kota Uenishi Preferred
Networks, Inc. 2022.07.29-31

Why Apache Ozone for AI systems? • S3 API •
Data is mostly images, audio & movies • Small ﬁles problem in HDFS well addressed • Deep learning ﬁts well with non-JVM interface: • Python (& C++) • MPI • GPU • Power of community from Hadoop • See also: A Year with Apache Ozone

Storage-Centric View of Our Workload Storage Cluster Compute Cluster Writes:
Data Generation by Simulation • Scientiﬁc Simulations • Scene Generation Reads: Model Training with Generated Data Real Data Smarter Models

Writes: Data Generation by Simulation

High-res 3D model scan & scene generation High-res models generated
by 3D scan Actual photo Models rotation (low-res) Mesh of 3D models https://www.preferred.jp/en/news/pr20210310/

Training data generated by computer graphics Huge number of training
data with huge variation by placing the 3D models

Workﬂow of Scene Generation Typical cycle of scene generation &
model training • New items 3D-scanned & added to the database • Generate millions of scene images within a few days • Train models for another few days with those data • Check the precision 3D models Scanned Scene Images Generated Train the DL model Check the Quality

If we store 1 image ﬁle / scene… Cons for
1 image ﬁle/scene • Speed: Ozone Manager is the bottleneck on putting & listing • Quantity: Close to a billion • Portability: Checksum maintenance made harder Pros • Performance: optimization in DNs

Small Files Problem w/billion keys: ListKeys Listing Keys overwhelmed 78%
of the overall full-burning CPU cores of OM listKeys

Workaround for ListKeys: rate limiting Guava's RateLimiter was easy to
use ~800 listKeys/sec => 256 (Commit: a2654d2e in github.com/pfnet/ozone )

Small Files Problem Next level: Put Keys Observed peak: ~1k
put/sec Possible causes: • "Giant" write lock on buckets in OzoneManager • Slow data put in datanodes (they've HDDs) • Per-connection authentication (HDDS-4440) Our workaround: • Reduce files in apps (e.g. keep wider image and crop/split it on use) • Make an archive per >10k files and reduce overhead Cost of ZIP archive: • No benefit of small files optimization in DN • Slow random reads on Training

What's Wrong with Archived ZIP files? Pros • Maintain a
checksum with the archive; easy & safe • ZIP is de-facto standard; Python support with standard library Cons • Read latency is x4 longer per file than flat files; very bad for DL training • No small file optimization So you might think of prefetching & pipelining a ZIP ….

Reads: Training AI Models with Generated Data

Deep Learning is Essentially Random Read for (e=0; e <
90; e++): # Shuffle images = shuffle(images) for (image, label) in images: # Random read img = image.open().read() # Train model in GPU loss = error(label, model(img)) model.backprop(loss) Shuﬀle Split Feed Repeat Images sampled and composed from ImageNet-1000 dataset

Numbers in Training • Data delivery to GPUs requires ~1000
images/sec/GPU • Our observations: ~7k images/sec random read for ﬂat • ~9GB/sec at disk IO • Only able to feed 7 GPUs in naive manner (^^; Workaround: cache data in node-local storage or in a shared NGINX server

Random Read Benchmark • Repeated random read of a 1.28M
dataset twice for… • Archived ZIP file on local FS • Flat image files on Ozone • Archived ZIP file on Ozone • Examples after 1.28M are cached in local storage and reading them are fast • Latency fluctuation is larger in ZIP

Storage Cluster Spec Ozone cluster • 12 DataNodes (&S3G) •
14TB 7200rpm HDDs x36/node • 512GB RAM/node • 20core Intel CPU • 100GbE x2 • Java 11 • Ubuntu Linux 18.04 Benchmark Client • Single proc in a node • 15GB Gen3 NVMe x4 • 384GB RAM/node • 20core Intel CPU • Python 3.8 • Ubuntu Linux 18.04

Resolution: Sparse File Cache The 16MB buﬀer works as implicit
prefetch for random read,without moving the spindle in the drives. The latency penalty is x4; ~0.1s vs ~0.4s Work in progress

Sparse ﬁles made it faster as local • You might
not see the diﬀerence very well…

Sparse ﬁles made it faster as local The log scale
graph indicates: • Latency of ﬁrst ~5% is in same level as other Ozone cases • After ~5% data read, the cache is warmed up faster than local NVMe drive (+kernel cache) and the latency is at the same level

Summary • We generate & maintain billions of image files
for AI training • Small files problem resurrected in a form of: • Slow listkeys: workaround by rate limiting (+FSO) • Slow PUT: workaround by archiving >10k files as one ZIP • Slow random read for an archive file • Will be worked around by sparse file cache

Thanks

FAQ Q. Sparse ﬁle technique is not new; like we
have in s3-fuse. A. We chose Python-native method to avoid fuse overhead. Q. Image size distribution A. See right graph Q. Any other workaround cache? A. Nginx cache server via HTTP Q. Why sparse ﬁle cache warms up faster than kernel cache? A. The "page" size is larger (4KB vs 16MB) Q. Other simulation use cases? A. We published a paper in Chemistry area

Apache Ozone behind Simulation and AI Industries

Apache Ozone behind Simulation and AI Industries

UENISHI Kota

More Decks by UENISHI Kota

Other Decks in Technology

Featured

Transcript