Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Ozone behind Simulation and AI Industries

Apache Ozone behind Simulation and AI Industries

My talk slides at ApacheCon Asia 2022

UENISHI Kota

July 30, 2022
Tweet

More Decks by UENISHI Kota

Other Decks in Technology

Transcript

  1. Apache Ozone behind
    Simulation and AI Industries
    Kota Uenishi
    Preferred Networks, Inc.
    2022.07.29-31

    View Slide

  2. Why Apache Ozone for AI systems?

    S3 API
    • Data is mostly images, audio & movies

    Small files problem in HDFS well addressed

    Deep learning fits well with non-JVM interface:
    • Python (& C++)
    • MPI
    • GPU

    Power of community from Hadoop

    See also: A Year with Apache Ozone

    View Slide

  3. Storage-Centric View of Our Workload
    Storage Cluster
    Compute Cluster
    Writes: Data Generation
    by Simulation
    ● Scientific Simulations
    ● Scene Generation
    Reads: Model Training
    with Generated Data
    Real Data
    Smarter Models

    View Slide

  4. Writes: Data Generation by
    Simulation

    View Slide

  5. High-res 3D model scan & scene generation
    High-res models generated by 3D scan
    Actual photo
    Models rotation (low-res) Mesh of 3D models
    https://www.preferred.jp/en/news/pr20210310/

    View Slide

  6. Training data generated by computer graphics
    Huge number of training data with huge variation by placing the 3D models

    View Slide

  7. Workflow of Scene Generation
    Typical cycle of scene generation & model training
    ● New items 3D-scanned & added to the database
    ● Generate millions of scene images within a few days
    ● Train models for another few days with those data
    ● Check the precision
    3D models
    Scanned
    Scene
    Images
    Generated
    Train the DL
    model
    Check
    the Quality

    View Slide

  8. If we store 1 image file / scene…
    Cons for 1 image file/scene

    Speed: Ozone Manager is the bottleneck on putting & listing

    Quantity: Close to a billion

    Portability: Checksum maintenance made harder
    Pros

    Performance: optimization in DNs

    View Slide

  9. Small Files Problem w/billion keys: ListKeys
    Listing Keys overwhelmed 78% of the overall full-burning CPU cores of OM
    listKeys

    View Slide

  10. Workaround for ListKeys: rate limiting
    Guava's RateLimiter was easy to use
    ~800 listKeys/sec => 256 (Commit: a2654d2e in github.com/pfnet/ozone )

    View Slide

  11. Small Files Problem Next level: Put Keys
    Observed peak: ~1k put/sec
    Possible causes:
    ● "Giant" write lock on buckets in OzoneManager
    ● Slow data put in datanodes (they've HDDs)
    ● Per-connection authentication (HDDS-4440)
    Our workaround:
    ● Reduce files in apps (e.g. keep wider image and crop/split it on use)
    ● Make an archive per >10k files and reduce overhead
    Cost of ZIP archive:
    ● No benefit of small files optimization in DN
    ● Slow random reads on Training

    View Slide

  12. What's Wrong with Archived ZIP files?
    Pros

    Maintain a checksum with the archive; easy & safe

    ZIP is de-facto standard; Python support with standard library
    Cons

    Read latency is x4 longer per file than flat files; very bad for DL training

    No small file optimization
    So you might think of prefetching & pipelining a ZIP ….

    View Slide

  13. Reads: Training AI Models with
    Generated Data

    View Slide

  14. Deep Learning is Essentially Random Read
    for (e=0; e < 90; e++):
    # Shuffle
    images = shuffle(images)
    for (image, label) in images:
    # Random read
    img = image.open().read()
    # Train model in GPU
    loss = error(label, model(img))
    model.backprop(loss)
    Shuffle
    Split
    Feed
    Repeat
    Images sampled and composed from ImageNet-1000 dataset

    View Slide

  15. Numbers in Training

    Data delivery to GPUs requires ~1000 images/sec/GPU
    • Our observations: ~7k images/sec random read for flat
    • ~9GB/sec at disk IO

    Only able to feed 7 GPUs in naive manner (^^;
    Workaround: cache data in node-local
    storage or in a shared NGINX server

    View Slide

  16. Random Read Benchmark
    • Repeated random read of a 1.28M
    dataset twice for…
    • Archived ZIP file on local FS
    • Flat image files on Ozone
    • Archived ZIP file on Ozone
    • Examples after 1.28M are cached
    in local storage and reading them
    are fast
    • Latency fluctuation is larger in ZIP

    View Slide

  17. Storage Cluster Spec
    Ozone cluster

    12 DataNodes (&S3G)

    14TB 7200rpm HDDs
    x36/node

    512GB RAM/node

    20core Intel CPU

    100GbE x2

    Java 11

    Ubuntu Linux 18.04
    Benchmark Client

    Single proc in a node

    15GB Gen3 NVMe x4

    384GB RAM/node

    20core Intel CPU

    Python 3.8

    Ubuntu Linux 18.04

    View Slide

  18. Resolution: Sparse File Cache
    The 16MB buffer works as
    implicit prefetch for random
    read,without moving the
    spindle in the drives.
    The latency penalty is x4;
    ~0.1s vs ~0.4s
    Work in progress

    View Slide

  19. Sparse files made it faster as local

    You might not see the
    difference very well…

    View Slide

  20. Sparse files made it faster as local
    The log scale graph indicates:
    • Latency of first ~5% is in same
    level as other Ozone cases
    • After ~5% data read, the cache is
    warmed up faster than local NVMe
    drive (+kernel cache) and the
    latency is at the same level

    View Slide

  21. Summary

    We generate & maintain billions of image files for AI training

    Small files problem resurrected in a form of:
    • Slow listkeys: workaround by rate limiting (+FSO)
    • Slow PUT: workaround by archiving >10k files as one ZIP

    Slow random read for an archive file
    • Will be worked around by sparse file cache

    View Slide

  22. Thanks

    View Slide

  23. FAQ
    Q. Sparse file technique is not new; like we have in
    s3-fuse.
    A. We chose Python-native method to avoid fuse
    overhead.
    Q. Image size distribution
    A. See right graph
    Q. Any other workaround cache?
    A. Nginx cache server via HTTP
    Q. Why sparse file cache warms up faster than
    kernel cache?
    A. The "page" size is larger (4KB vs 16MB)
    Q. Other simulation use cases?
    A. We published a paper in Chemistry area

    View Slide