Slide 50
Slide 50 text
参考文献
[Wu+,2023] Wu, C. et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv, 2023.
[You+,2023] You, H. et al. IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models. arXiv,
2023.
[Oord+,2017] Aaron van den Oord et al. Neural Discrete Representation Learning. NIPS2017.
[Ramesh+,2021] Aditya Ramesh, et al. Zero-Shot Text-to-Image generation. arXiv, 2021.
[Mizrahi+,2017] Mizrahi, D. et al. 4M: Massively Multimodal Masked Modeling. NeurIPS, 2023.
[Ramesh+,2022] Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical Text-Conditional Image Generation with CLIP
Latents. arXiv, 2022.
[Cha+,2023] Cha, J., Kang, W., Mun, J. & Roh, B. Honeybee: Locality-enhanced Projector for Multimodal LLM. arXiv, 2023.
[LI+,2023] Li, J., Li, D., Savarese, S. & Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and
Large Language Models. arXiv, 2023.
[Dai+,2023] Dai, W. et al. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv, 2023.
[Li+,2023] Li, K. et al. VideoChat: Chat-Centric Video Understanding. arXiv, 2023.
[Zhu+,2023] Zhu, D., Chen, J., Shen, X., Li, X. & Elhoseiny, M. MiniGPT-4: Enhancing Vision-Language Understanding with
Advanced Large Language Models. arXiv, 2023.
[Liu+,2023a] Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual Instruction Tuning. arXiv, 2023.
[Liu+,2023b] Liu, H., Li, C., Li, Y. & Lee, Y. J. Improved Baselines with Visual Instruction Tuning. arXiv, 2023.
[Zhang+,2023] Zhang, H. et al. LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models. arXiv, 2023.
[Radford+,2021] Alec Radford, et al. Learning transferable visual models from natural language supervision. ICML, Vol. 139, pp.
8748–8763, 2021.
[Maini+,2023] Maini, P., Goyal, S., Lipton, Z. C., Zico Kolter, J. & Raghunathan, A. T-MARS: Improving Visual Representations by
Circumventing Text Feature Learning. arXiv, 2023.
[Shtedritski+,2023] Shtedritski, A., Rupprecht, C. & Vedaldi, A. What does CLIP know about a red circle? Visual prompt
engineering for VLMs. ICCV, 2023.
49/51