Slide 18
Slide 18 text
Reference
Mini-GPT4 Zhu et al., “MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models”, arXiv, 2023
LLaVA Liu et al., “Visual Instruction Tuning”, arXiv, 2023
InstructBLIP Dai et al., “InstructBLIP: Towards General-purpose Vision-Language Models with Instruction
Tuning”, arXiv, 2023
X-LLM Chen et al., “X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as
Foreign Languages”, arXiv, 2023
VisionLLM Wang et al., “VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-
Centric Tasks”, arXiv, 2023
Multimodal-GPT Gong et al., “MultiModal-GPT: A Vision and Language Model for Dialogue with Humans”,
arXiv, 2023
ChatBridge Zhao et al., “ChatBridge: Bridging Modalities with Large Language Model as a Language
Catalyst”, arXiv, 2023
LLaVA-Med Li et al., “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One
Day”, arXiv, 2023
M3IT Li et al., “M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning”, arXiv, 2023