Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Ozone behind Simulation and AI Industries

Apache Ozone behind Simulation and AI Industries

My talk slides at ApacheCon Asia 2022


July 30, 2022

More Decks by UENISHI Kota

Other Decks in Technology


  1. Why Apache Ozone for AI systems? • S3 API •

    Data is mostly images, audio & movies • Small files problem in HDFS well addressed • Deep learning fits well with non-JVM interface: • Python (& C++) • MPI • GPU • Power of community from Hadoop • See also: A Year with Apache Ozone
  2. Storage-Centric View of Our Workload Storage Cluster Compute Cluster Writes:

    Data Generation by Simulation • Scientific Simulations • Scene Generation Reads: Model Training with Generated Data Real Data Smarter Models
  3. High-res 3D model scan & scene generation High-res models generated

    by 3D scan Actual photo Models rotation (low-res) Mesh of 3D models https://www.preferred.jp/en/news/pr20210310/
  4. Training data generated by computer graphics Huge number of training

    data with huge variation by placing the 3D models
  5. Workflow of Scene Generation Typical cycle of scene generation &

    model training • New items 3D-scanned & added to the database • Generate millions of scene images within a few days • Train models for another few days with those data • Check the precision 3D models Scanned Scene Images Generated Train the DL model Check the Quality
  6. If we store 1 image file / scene… Cons for

    1 image file/scene • Speed: Ozone Manager is the bottleneck on putting & listing • Quantity: Close to a billion • Portability: Checksum maintenance made harder Pros • Performance: optimization in DNs
  7. Small Files Problem w/billion keys: ListKeys Listing Keys overwhelmed 78%

    of the overall full-burning CPU cores of OM listKeys
  8. Workaround for ListKeys: rate limiting Guava's RateLimiter was easy to

    use ~800 listKeys/sec => 256 (Commit: a2654d2e in github.com/pfnet/ozone )
  9. Small Files Problem Next level: Put Keys Observed peak: ~1k

    put/sec Possible causes: • "Giant" write lock on buckets in OzoneManager • Slow data put in datanodes (they've HDDs) • Per-connection authentication (HDDS-4440) Our workaround: • Reduce files in apps (e.g. keep wider image and crop/split it on use) • Make an archive per >10k files and reduce overhead Cost of ZIP archive: • No benefit of small files optimization in DN • Slow random reads on Training
  10. What's Wrong with Archived ZIP files? Pros • Maintain a

    checksum with the archive; easy & safe • ZIP is de-facto standard; Python support with standard library Cons • Read latency is x4 longer per file than flat files; very bad for DL training • No small file optimization So you might think of prefetching & pipelining a ZIP ….
  11. Deep Learning is Essentially Random Read for (e=0; e <

    90; e++): # Shuffle images = shuffle(images) for (image, label) in images: # Random read img = image.open().read() # Train model in GPU loss = error(label, model(img)) model.backprop(loss) Shuffle Split Feed Repeat Images sampled and composed from ImageNet-1000 dataset
  12. Numbers in Training • Data delivery to GPUs requires ~1000

    images/sec/GPU • Our observations: ~7k images/sec random read for flat • ~9GB/sec at disk IO • Only able to feed 7 GPUs in naive manner (^^; Workaround: cache data in node-local storage or in a shared NGINX server
  13. Random Read Benchmark • Repeated random read of a 1.28M

    dataset twice for… • Archived ZIP file on local FS • Flat image files on Ozone • Archived ZIP file on Ozone • Examples after 1.28M are cached in local storage and reading them are fast • Latency fluctuation is larger in ZIP
  14. Storage Cluster Spec Ozone cluster • 12 DataNodes (&S3G) •

    14TB 7200rpm HDDs x36/node • 512GB RAM/node • 20core Intel CPU • 100GbE x2 • Java 11 • Ubuntu Linux 18.04 Benchmark Client • Single proc in a node • 15GB Gen3 NVMe x4 • 384GB RAM/node • 20core Intel CPU • Python 3.8 • Ubuntu Linux 18.04
  15. Resolution: Sparse File Cache The 16MB buffer works as implicit

    prefetch for random read,without moving the spindle in the drives. The latency penalty is x4; ~0.1s vs ~0.4s Work in progress
  16. Sparse files made it faster as local • You might

    not see the difference very well…
  17. Sparse files made it faster as local The log scale

    graph indicates: • Latency of first ~5% is in same level as other Ozone cases • After ~5% data read, the cache is warmed up faster than local NVMe drive (+kernel cache) and the latency is at the same level
  18. Summary • We generate & maintain billions of image files

    for AI training • Small files problem resurrected in a form of: • Slow listkeys: workaround by rate limiting (+FSO) • Slow PUT: workaround by archiving >10k files as one ZIP • Slow random read for an archive file • Will be worked around by sparse file cache
  19. FAQ Q. Sparse file technique is not new; like we

    have in s3-fuse. A. We chose Python-native method to avoid fuse overhead. Q. Image size distribution A. See right graph Q. Any other workaround cache? A. Nginx cache server via HTTP Q. Why sparse file cache warms up faster than kernel cache? A. The "page" size is larger (4KB vs 16MB) Q. Other simulation use cases? A. We published a paper in Chemistry area