LAION-5B: An open large-scale dataset for training next generation image-text models

LAION-5B: An open large-scale dataset for training next generation image-text
models Mehdi Cherti Morocco AI Webinar, 16th Nov. 2022

Recent advances in multimodal image-text models DALL-E 2: text-to-image model
~5.5B model trained on 650M image-text pairs from a private unreleased dataset Recent advances in multimodal image-text models

Recent advances in multimodal image-text models Contrastive Language Image pre-training
(CLIP) Trained on 400M image-text pairs from a private unreleased dataset 2.transfer

Recent advances in multimodal image-text models Open Vocabulary models like
CLIP have zero-shot capabilities. They can be applied to any classification task, only using class descriptions (no annotated labels needed) Zero-shot performance ~equivalent to a ResNet-50 trained on 1.28M examples in a supervised way!

Recent advances in multimodal image-text models CLIP shows better robustness
to distribution shift compare to supervised models

Recent advances in multimodal image-text models More recent works (e.g.,
ALIGN, BASIC, LiT, CoCA) improved further the results: - By scaling data/model size (ALIGN , BASIC) - By using frozen pre-trained encoders (LiT) - By using additional captioning loss (CoCa) ALIGN: 1.8B image-text pairs BASIC: 6.6B image-text pairs LiT: 4B image-text pairs CoCa: 3.8B image-text pairs

- None of the large datasets used in image-text models
are available publicly - Datasets only available to a small number of industrial labs - Difficult to study training of text-image models at large scale and improve them We propose LAION-5B, an open dataset of 5.85 billion image-text pairs filtered from CommonCrawl

- 5.85B total image-text pairs: - 39% with english captions
- 61% with other languages - CLIP-filtered (ViT-B/32) from Common Crawl to have reasonable text-image alignment What is LAION-5B?

What is LAION-5B? Img2dataset (https://github.com/rom1504/img2dataset) to download the dataset or
a subset of it. - ~220 TB of storage needed for the full dataset (2.65 TB for the metadata). - In the metadata, we provide: - Url of the image - Caption - CLIP cosine similarity between image and caption - NSFW score - Watermark score

What is LAION-5B? CLIP retrieval https://knn5.laion.ai image/text search on LAION-5B
using CLIP embeddings

Projects using LAION-5B Subset generation - LAION-High-Resolution, 70M subset for
training super-resolution models - LAION-Aesthetic, 120M subset of aesthetic images, determined by a linear estimator on top of CLIP

Projects using LAION-5B Stable Diffusion, text-to-image generative model Trained a
text-to-image Latent Diffusion Model (LDM) on 512x512 resolution using: - LAION-2B-en, - LAION-High-Resolution - LAION-aesthetic

We use OpenCLIP to pre-train models of different sizes Reproducing
and evaluating CLIP

Reproducing and evaluating CLIP We train the models on large
supercomputers: - JUWELS Booster, Juelich Supercomputing Center (JSC) - 3744 NVIDIA A100 GPUs - Stability AI AWS supercomputer - 5408 NVIDIA A100 GPUs

- Performance improve smoothly with scale following a power-law form,
when no bottleneck - Performance with scale is remarkably predictable Neural scaling laws Kaplan et al. 2020 Language modeling task

Implications: - a) Extrapolate model performance on larger scale -
b) Compute optimal model size for a given compute budget - c) Compare scaling curves of different architectures/pre-training datasets/losses Neural scaling laws

- Not only for test loss, also works for downstream
transfer - Also for different domains/architectures, not only language modeling Neural scaling laws Scaling vision transformers (ViT)

We use OpenCLIP to pre-train models of different sizes on
LAION-400M/2B Reproducing and evaluating CLIP

Reproducing and evaluating CLIP We evaluate the models on zero-shot
classification on 35 tasks (VTAB+)

Reproducing and evaluating CLIP

Effect of data scale Reproducing and evaluating CLIP

Reproducing and evaluating CLIP Zero-shot retrieval results

Thank you for listening ! - Paper: https://arxiv.org/abs/2210.08402 - OpenReview:
https://openreview.net/forum?id=M3Y74vmsMcY - Blog post: https://laion.ai/blog/laion-5b/ - Download tool: https://github.com/rom1504/img2dataset - CLIP Retrieval: https://github.com/rom1504/clip-retrieval - Dataset exploration: https://knn5.laion.ai - OpenCLIP: https://github.com/mlfoundations/open_clip - Detailed CLIP evaluation and benchmark: https://github.com/LAION-AI/CLIP_benchmark Join our LAION Discord server: https://discord.gg/nGuc6rGdqP

LAION-5B: An open large-scale dataset for train...

LAION-5B: An open large-scale dataset for training next generation image-text models

Mehdi

More Decks by Mehdi

Featured

Transcript