Slide 11
Slide 11 text
How to Tokenize Image Features
Feature vectors
Projector
(MLP)
Image
Encoder
Image tokens
Transformer
Language tokens
Image
Encoder
Adapter Language tokens
Transformer
Special tokens
Using projector
GIT [Wang+], LLaVA [Liu+]...
Using cross attention
BLIP2 [Li+], Flamingo [Alayrac+]
Alayrac, Jean-Baptiste, et al. "Flamingo: a visual language model for few-shot learning."
NeurIPS 2022.
Li, Junnan, et al. "Blip-2: Bootstrapping language-image pre-training with frozen image
encoders and large language models." ICML 2023.
Wang, Jianfeng, et al. "Git: A generative image-to-text transformer for vision and language." arXiv
preprint arXiv:2205.14100 (2022).
Liu, Haotian, et al. "Visual instruction tuning." NeurIPS 2024. 11