CLIP challenges and scaling

CLIP challenges and scaling ELLIS-LAION workshop on foundation models 13
july 2023 Mehdi Cherti

Can we generate new product designs, new scientific theories ,new
music styles, new painting styles, etc. ?

Even now, focus is more on image quality

Back to 2015…

Why am I mentioning this ? - Even SOTA diffusion
models seem to struggle on this setup - A simple test for out-of-distribution capabilities and compositionality - If you fine-tune (e.g., LoRa, or full fine-tuning) a stable diffusion model on MNIST with text describing the digit categories => models struggle to generate new categories

Recent advances in multimodal image-text models Contrastive Language Image pre-training
(CLIP) 2.transfer

Recent advances in multimodal image-text models Open Vocabulary models like
CLIP have zero-shot capabilities. They can be applied to any classification task, only using class descriptions (no annotated labels needed) Zero-shot performance ~equivalent to a ResNet-50 trained on 1.28M examples in a supervised way!

Recent advances in multimodal image-text models More recent works (e.g.,
ALIGN, BASIC, LiT, CoCA) improved further the results: - By scaling data/model size (ALIGN , BASIC) - By using frozen pre-trained encoders (LiT) - By using additional captioning loss (CoCa) ALIGN: 1.8B image-text pairs BASIC: 6.6B image-text pairs LiT: 4B image-text pairs CoCa: 3.8B image-text pairs

Challenges in CLIP models - Issues with compositionality e.g., problems
with relations and attributes - Handling long and detailed prompts - Attribute based prompts / description, in general how to to handling of new categories/concepts (VL-Taboo from Vogel et. al)

Scaling CLIP models - Using large pre-trained unimodal encoders/decoders, e.g.
- MT5-XXL (text encoder), ~7B - ViT-G/14 (image encoder) ~2.5B - MT5-XXL (text decoder) ~7B - Unfreeze partly few layers, find the best unfreezing schedule to optimize compute - Training on higher resolution - Using better filtered datasets such as DataComp 1.4B - Challenge: too many moving parts/choices, small scale experiments that can predict large scale ones

- Contrastive loss - Generative losses - Text to image
- Image to text - Self-consistency loss : im -> text -> im, text -> im -> text - Unimodal losses? - Generating hard negatives adversarially? CLIP extensions

Evaluating CLIP models - Zero-shot classification, retrieval etc - Compositionality
tasks, e.g. CREPE/SugarCREPE - How does it improve other tasks when stacked into another model, e.g. text-to-image, with focus on out-of-distribution

Multi-modal data and open models for science ML for science:
- As a research aid, e.g. to deal with huge number of papers - Help on understanding, writing, or summarizing papers - Make new connections between subjects

Multi-modal data and open models for science - Millions of
papers are available: arxiv, PubMed, semantic scholar, etc., also scientific books - High-quality structured documents - Figures and captions - Citations graph

First tests with PubMed - ~5.2M papers - We extract
18.3M figure-caption pairs (XML based metadata provided in PubMed), similar to “Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing”

PubMed dataset Comparison of GshB structures from a bacterium and
a eukaryote. (a) Human; (b) E. coli. Reproduced with permission from [43] One of Two BottlenoseDolphins That Passed the Mark Test, Thus Demonstrating Mirror Self-Recognition(Photo credit: Diana Reiss, Wildlife Conservation Society) PFV IN active site in committed and drug-bound statesViews without drug (a) and with MK0518 (b) or GS9137 (c) bound. Protein and DNA in upper panels are cartoons, with A17, DNA bases and the side chains of indicated amino acids as sticks. Drug atoms are colored: yellow, C; blue, N; red, O; orange, P; gray, F; green, Cl. The complex is shown as a solvent accessible surface in lower panels, colored by atoms (light gray, C; red, O; blue, N). Gray spheres are Mn2+ (a, labeled A and B) or Mg2+ (b, c) ions.

Model image_retrieval_recall@5 (biorxiv 2K sample) ViT-B/16 (CLIP) 0.76 BioGPT (text,
pre-trained) + ViTB/16 (image) 0.80 -> add CoCa loss 0.87 PubMedBert (text, pre-trained) + ViT-B/16 (image) 0.78 -> Train 2x longer 0.80 -> Res 336 0.83 -> 256 context length 0.74 Initial results with openCLIP

Next steps - Integrate more datasets: biorxiv, medarxiv, arxiv, semantic
scholar, etc. - Design test suite for multimodal evaluation of science knowledge - Design an interleaved image-text dataset, to work on document

Thank you for attention

CLIP challenges and scaling

CLIP challenges and scaling

Mehdi

More Decks by Mehdi

Featured

Transcript