Slide 1

Slide 1 text

CLIP challenges and scaling ELLIS-LAION workshop on foundation models 13 july 2023 Mehdi Cherti

Slide 2

Slide 2 text

Can we generate new product designs, new scientific theories ,new music styles, new painting styles, etc. ?

Slide 3

Slide 3 text

Even now, focus is more on image quality

Slide 4

Slide 4 text

Back to 2015…

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Why am I mentioning this ? - Even SOTA diffusion models seem to struggle on this setup - A simple test for out-of-distribution capabilities and compositionality - If you fine-tune (e.g., LoRa, or full fine-tuning) a stable diffusion model on MNIST with text describing the digit categories => models struggle to generate new categories

Slide 14

Slide 14 text

Recent advances in multimodal image-text models Contrastive Language Image pre-training (CLIP) 2.transfer

Slide 15

Slide 15 text

Recent advances in multimodal image-text models Open Vocabulary models like CLIP have zero-shot capabilities. They can be applied to any classification task, only using class descriptions (no annotated labels needed) Zero-shot performance ~equivalent to a ResNet-50 trained on 1.28M examples in a supervised way!

Slide 16

Slide 16 text

Recent advances in multimodal image-text models More recent works (e.g., ALIGN, BASIC, LiT, CoCA) improved further the results: - By scaling data/model size (ALIGN , BASIC) - By using frozen pre-trained encoders (LiT) - By using additional captioning loss (CoCa) ALIGN: 1.8B image-text pairs BASIC: 6.6B image-text pairs LiT: 4B image-text pairs CoCa: 3.8B image-text pairs

Slide 17

Slide 17 text

Challenges in CLIP models - Issues with compositionality e.g., problems with relations and attributes - Handling long and detailed prompts - Attribute based prompts / description, in general how to to handling of new categories/concepts (VL-Taboo from Vogel et. al)

Slide 18

Slide 18 text

Scaling CLIP models - Using large pre-trained unimodal encoders/decoders, e.g. - MT5-XXL (text encoder), ~7B - ViT-G/14 (image encoder) ~2.5B - MT5-XXL (text decoder) ~7B - Unfreeze partly few layers, find the best unfreezing schedule to optimize compute - Training on higher resolution - Using better filtered datasets such as DataComp 1.4B - Challenge: too many moving parts/choices, small scale experiments that can predict large scale ones

Slide 19

Slide 19 text

- Contrastive loss - Generative losses - Text to image - Image to text - Self-consistency loss : im -> text -> im, text -> im -> text - Unimodal losses? - Generating hard negatives adversarially? CLIP extensions

Slide 20

Slide 20 text

Evaluating CLIP models - Zero-shot classification, retrieval etc - Compositionality tasks, e.g. CREPE/SugarCREPE - How does it improve other tasks when stacked into another model, e.g. text-to-image, with focus on out-of-distribution

Slide 21

Slide 21 text

Multi-modal data and open models for science ML for science: - As a research aid, e.g. to deal with huge number of papers - Help on understanding, writing, or summarizing papers - Make new connections between subjects

Slide 22

Slide 22 text

Multi-modal data and open models for science - Millions of papers are available: arxiv, PubMed, semantic scholar, etc., also scientific books - High-quality structured documents - Figures and captions - Citations graph

Slide 23

Slide 23 text

First tests with PubMed - ~5.2M papers - We extract 18.3M figure-caption pairs (XML based metadata provided in PubMed), similar to “Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing”

Slide 24

Slide 24 text

PubMed dataset Comparison of GshB structures from a bacterium and a eukaryote. (a) Human; (b) E. coli. Reproduced with permission from [43] One of Two BottlenoseDolphins That Passed the Mark Test, Thus Demonstrating Mirror Self-Recognition(Photo credit: Diana Reiss, Wildlife Conservation Society) PFV IN active site in committed and drug-bound statesViews without drug (a) and with MK0518 (b) or GS9137 (c) bound. Protein and DNA in upper panels are cartoons, with A17, DNA bases and the side chains of indicated amino acids as sticks. Drug atoms are colored: yellow, C; blue, N; red, O; orange, P; gray, F; green, Cl. The complex is shown as a solvent accessible surface in lower panels, colored by atoms (light gray, C; red, O; blue, N). Gray spheres are Mn2+ (a, labeled A and B) or Mg2+ (b, c) ions.

Slide 25

Slide 25 text

Model image_retrieval_recall@5 (biorxiv 2K sample) ViT-B/16 (CLIP) 0.76 BioGPT (text, pre-trained) + ViTB/16 (image) 0.80 -> add CoCa loss 0.87 PubMedBert (text, pre-trained) + ViT-B/16 (image) 0.78 -> Train 2x longer 0.80 -> Res 336 0.83 -> 256 context length 0.74 Initial results with openCLIP

Slide 26

Slide 26 text

Next steps - Integrate more datasets: biorxiv, medarxiv, arxiv, semantic scholar, etc. - Design test suite for multimodal evaluation of science knowledge - Design an interleaved image-text dataset, to work on document

Slide 27

Slide 27 text

Thank you for attention