Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

CLIP challenges and scaling

Mehdi
July 20, 2023
23

CLIP challenges and scaling

Mehdi

July 20, 2023
Tweet

Transcript

  1. Can we generate new product designs, new scientific theories ,new

    music styles, new painting styles, etc. ?
  2. Why am I mentioning this ? - Even SOTA diffusion

    models seem to struggle on this setup - A simple test for out-of-distribution capabilities and compositionality - If you fine-tune (e.g., LoRa, or full fine-tuning) a stable diffusion model on MNIST with text describing the digit categories => models struggle to generate new categories
  3. Recent advances in multimodal image-text models Open Vocabulary models like

    CLIP have zero-shot capabilities. They can be applied to any classification task, only using class descriptions (no annotated labels needed) Zero-shot performance ~equivalent to a ResNet-50 trained on 1.28M examples in a supervised way!
  4. Recent advances in multimodal image-text models More recent works (e.g.,

    ALIGN, BASIC, LiT, CoCA) improved further the results: - By scaling data/model size (ALIGN , BASIC) - By using frozen pre-trained encoders (LiT) - By using additional captioning loss (CoCa) ALIGN: 1.8B image-text pairs BASIC: 6.6B image-text pairs LiT: 4B image-text pairs CoCa: 3.8B image-text pairs
  5. Challenges in CLIP models - Issues with compositionality e.g., problems

    with relations and attributes - Handling long and detailed prompts - Attribute based prompts / description, in general how to to handling of new categories/concepts (VL-Taboo from Vogel et. al)
  6. Scaling CLIP models - Using large pre-trained unimodal encoders/decoders, e.g.

    - MT5-XXL (text encoder), ~7B - ViT-G/14 (image encoder) ~2.5B - MT5-XXL (text decoder) ~7B - Unfreeze partly few layers, find the best unfreezing schedule to optimize compute - Training on higher resolution - Using better filtered datasets such as DataComp 1.4B - Challenge: too many moving parts/choices, small scale experiments that can predict large scale ones
  7. - Contrastive loss - Generative losses - Text to image

    - Image to text - Self-consistency loss : im -> text -> im, text -> im -> text - Unimodal losses? - Generating hard negatives adversarially? CLIP extensions
  8. Evaluating CLIP models - Zero-shot classification, retrieval etc - Compositionality

    tasks, e.g. CREPE/SugarCREPE - How does it improve other tasks when stacked into another model, e.g. text-to-image, with focus on out-of-distribution
  9. Multi-modal data and open models for science ML for science:

    - As a research aid, e.g. to deal with huge number of papers - Help on understanding, writing, or summarizing papers - Make new connections between subjects
  10. Multi-modal data and open models for science - Millions of

    papers are available: arxiv, PubMed, semantic scholar, etc., also scientific books - High-quality structured documents - Figures and captions - Citations graph
  11. First tests with PubMed - ~5.2M papers - We extract

    18.3M figure-caption pairs (XML based metadata provided in PubMed), similar to “Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing”
  12. PubMed dataset Comparison of GshB structures from a bacterium and

    a eukaryote. (a) Human; (b) E. coli. Reproduced with permission from [43] One of Two BottlenoseDolphins That Passed the Mark Test, Thus Demonstrating Mirror Self-Recognition(Photo credit: Diana Reiss, Wildlife Conservation Society) PFV IN active site in committed and drug-bound statesViews without drug (a) and with MK0518 (b) or GS9137 (c) bound. Protein and DNA in upper panels are cartoons, with A17, DNA bases and the side chains of indicated amino acids as sticks. Drug atoms are colored: yellow, C; blue, N; red, O; orange, P; gray, F; green, Cl. The complex is shown as a solvent accessible surface in lower panels, colored by atoms (light gray, C; red, O; blue, N). Gray spheres are Mn2+ (a, labeled A and B) or Mg2+ (b, c) ions.
  13. Model image_retrieval_recall@5 (biorxiv 2K sample) ViT-B/16 (CLIP) 0.76 BioGPT (text,

    pre-trained) + ViTB/16 (image) 0.80 -> add CoCa loss 0.87 PubMedBert (text, pre-trained) + ViT-B/16 (image) 0.78 -> Train 2x longer 0.80 -> Res 336 0.83 -> 256 context length 0.74 Initial results with openCLIP
  14. Next steps - Integrate more datasets: biorxiv, medarxiv, arxiv, semantic

    scholar, etc. - Design test suite for multimodal evaluation of science knowledge - Design an interleaved image-text dataset, to work on document