Advanced Large Language Models”, arXiv, 2023 LLaVA Liu et al., “Visual Instruction Tuning”, arXiv, 2023 InstructBLIP Dai et al., “InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning”, arXiv, 2023 X-LLM Chen et al., “X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages”, arXiv, 2023 VisionLLM Wang et al., “VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision- Centric Tasks”, arXiv, 2023 Multimodal-GPT Gong et al., “MultiModal-GPT: A Vision and Language Model for Dialogue with Humans”, arXiv, 2023 ChatBridge Zhao et al., “ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst”, arXiv, 2023 LLaVA-Med Li et al., “LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day”, arXiv, 2023 M3IT Li et al., “M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning”, arXiv, 2023