Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LAION-5B: An open large-scale dataset for training next generation image-text models

Mehdi
July 20, 2023
68

LAION-5B: An open large-scale dataset for training next generation image-text models

Mehdi

July 20, 2023
Tweet

Transcript

  1. LAION-5B: An open large-scale dataset for training next generation image-text

    models Mehdi Cherti Morocco AI Webinar, 16th Nov. 2022
  2. Recent advances in multimodal image-text models DALL-E 2: text-to-image model

    ~5.5B model trained on 650M image-text pairs from a private unreleased dataset Recent advances in multimodal image-text models
  3. Recent advances in multimodal image-text models Contrastive Language Image pre-training

    (CLIP) Trained on 400M image-text pairs from a private unreleased dataset 2.transfer
  4. Recent advances in multimodal image-text models Open Vocabulary models like

    CLIP have zero-shot capabilities. They can be applied to any classification task, only using class descriptions (no annotated labels needed) Zero-shot performance ~equivalent to a ResNet-50 trained on 1.28M examples in a supervised way!
  5. Recent advances in multimodal image-text models CLIP shows better robustness

    to distribution shift compare to supervised models
  6. Recent advances in multimodal image-text models More recent works (e.g.,

    ALIGN, BASIC, LiT, CoCA) improved further the results: - By scaling data/model size (ALIGN , BASIC) - By using frozen pre-trained encoders (LiT) - By using additional captioning loss (CoCa) ALIGN: 1.8B image-text pairs BASIC: 6.6B image-text pairs LiT: 4B image-text pairs CoCa: 3.8B image-text pairs
  7. - None of the large datasets used in image-text models

    are available publicly - Datasets only available to a small number of industrial labs - Difficult to study training of text-image models at large scale and improve them We propose LAION-5B, an open dataset of 5.85 billion image-text pairs filtered from CommonCrawl
  8. - 5.85B total image-text pairs: - 39% with english captions

    - 61% with other languages - CLIP-filtered (ViT-B/32) from Common Crawl to have reasonable text-image alignment What is LAION-5B?
  9. What is LAION-5B? Img2dataset (https://github.com/rom1504/img2dataset) to download the dataset or

    a subset of it. - ~220 TB of storage needed for the full dataset (2.65 TB for the metadata). - In the metadata, we provide: - Url of the image - Caption - CLIP cosine similarity between image and caption - NSFW score - Watermark score
  10. Projects using LAION-5B Subset generation - LAION-High-Resolution, 70M subset for

    training super-resolution models - LAION-Aesthetic, 120M subset of aesthetic images, determined by a linear estimator on top of CLIP
  11. Projects using LAION-5B Stable Diffusion, text-to-image generative model Trained a

    text-to-image Latent Diffusion Model (LDM) on 512x512 resolution using: - LAION-2B-en, - LAION-High-Resolution - LAION-aesthetic
  12. Reproducing and evaluating CLIP We train the models on large

    supercomputers: - JUWELS Booster, Juelich Supercomputing Center (JSC) - 3744 NVIDIA A100 GPUs - Stability AI AWS supercomputer - 5408 NVIDIA A100 GPUs
  13. - Performance improve smoothly with scale following a power-law form,

    when no bottleneck - Performance with scale is remarkably predictable Neural scaling laws Kaplan et al. 2020 Language modeling task
  14. Implications: - a) Extrapolate model performance on larger scale -

    b) Compute optimal model size for a given compute budget - c) Compare scaling curves of different architectures/pre-training datasets/losses Neural scaling laws
  15. - Not only for test loss, also works for downstream

    transfer - Also for different domains/architectures, not only language modeling Neural scaling laws Scaling vision transformers (ViT)
  16. We use OpenCLIP to pre-train models of different sizes on

    LAION-400M/2B Reproducing and evaluating CLIP
  17. Thank you for listening ! - Paper: https://arxiv.org/abs/2210.08402 - OpenReview:

    https://openreview.net/forum?id=M3Y74vmsMcY - Blog post: https://laion.ai/blog/laion-5b/ - Download tool: https://github.com/rom1504/img2dataset - CLIP Retrieval: https://github.com/rom1504/clip-retrieval - Dataset exploration: https://knn5.laion.ai - OpenCLIP: https://github.com/mlfoundations/open_clip - Detailed CLIP evaluation and benchmark: https://github.com/LAION-AI/CLIP_benchmark Join our LAION Discord server: https://discord.gg/nGuc6rGdqP