(CVPR2024) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Kazuya Nishimura

本日紹介する論文に関して世は大 benchmark 時代 ✓LLM benchmark ✓ Multi-modal Benchmark 1
MMMU ChartQA MMLU Livecode Bench Point1. Benchmarkってどんなことを考慮？

本日紹介する論文に関して Multi-modal では vision encoder + LLM の組み合わせ 2 CLIP
など LLAVA CLIP など (LLM) BLIP2 Point 2. Vision の encoder の学習って終わった？？

論文概要：Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
3 問い：Vision encoder は言語処理に十分？簡単な問いでも MLLMs が間違える問題を集めたbenchmarkを作成間違った返答返答の補足

Benchmark の作成方法の概要 4

Step1. Finding CLIP blind pairs CLIP と DINO v2 の違いを見る
◦ CLIP: テキストと画像で学習 ◦ DINO: 画像だけで学習少し違う作法のモデルで違いがある → 画像に何らかの違いが生じるはず？ (Sim DINO < 0.65) & (Sim CLIP > 0.95) のサンプルを集める 5 Sim. の違いを見るのは間違っているのを探すのに有効らしい [Tong+, Neurips 2024] 例：diffussion model で反映できない違いをCLIP text の encode で発見違う意味の caption が similarity 高い → 問題あり

Step2. Spotting the difference between two images 見つかった2枚の画像の違いに manual annotation
MMVP-VQA Benchmark を作成 Multimodal Visual Patterns Visual Question Answering ◦ 150 pairs に 300 questions ◦ 見逃されてそうな視覚的違いに注目 6

Step 3. Benchmarking MLLMs 7 両方に正解できるかを比較 SOTA の LLM を比較
◦ Open-source models ✓LLaVA-1.5 ✓InstructBLIP ✓Mini-GPT4 ◦ Closed source models ✓GPT-4V ✓Gemini ✓Bard

データセットの例 8 Benchmarking の結果

全体の傾向 9 人間には超簡単 Random に回答 Vision and language を 1
から学習してるはずだが性能は低い Benchmarking の結果多くのVLM で random 以下…

推定結果の例 10 Benchmarking の結果

9 pattern に分けて傾向を分析 11 さらに詳しい傾向の観察

MMVP-VLM Benchmark のタスクに関して 12 画像と言語の対応関係が正しければ正解

MMVP-VLM banchmarking の結果 13 学習データのscale -> 性能多くはデータのスケール≠性能

CLIPとMLLMの性能低下には相関が存在する？ 14 CLIP-based とは高く相関

性能低下は vision encoder のせい？ ◦ CLIPの特徴＋DINOの特徴 (画像だけで学習) を導入する効果 15 違うencoder
の特徴も使ってみた違うencoder の特徴も使ってみた

特徴を単純に足し合わせて使用した場合 ◦ Vision 特徴を足し合わせて使用した際の性能 16 今回のベンチマークの性能 Instruction に従う性能

複数のencoder の統合 ◦ 特徴の組み合わせ方を工夫した際の性能 17 両方の性能を保ちつつ性能向上

まとめ目標：画像表現は言語処理に十分か？を検証内容：画像表現の問題で推定が難しい MMVP benchmarkを提案画像表現の問題であることと複数のencoder の組み合わせを提案結論：課題あり．Vision だけ，VLの学習それぞれに良さがあり欠点の改善はスケーリングだけでは対処が難しい？
感想＆議論： ✓ Benchmark を考える際 -> できないことを探す効率的な方法が重要 ✓ CLIP, DINO v2 だけが比較だが，アルゴリズム的な違いもみてみたい Contrastive だけ，MAE, video… アルゴリズムの優位性を主張するチャンスかも？ ✓ 複数のモデルの統合は, mixture of expert 的な統合が良い気がする 18

(CVPR2024) Eyes Wide Shut? Exploring the Visua...

(CVPR2024) Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Kazuya Nishimura

More Decks by Kazuya Nishimura

Other Decks in Research

Featured

Transcript

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

本日紹介する論文に関して世は大 benchmark 時代 ✓LLM benchmark ✓ Multi-modal Benchmark 1

本日紹介する論文に関して Multi-modal では vision encoder + LLM の組み合わせ 2 CLIP

論文概要：Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Benchmark の作成方法の概要 4

Step1. Finding CLIP blind pairs CLIP と DINO v2 の違いを見る

Step2. Spotting the difference between two images 見つかった2枚の画像の違いに manual annotation

Step 3. Benchmarking MLLMs 7 両方に正解できるかを比較 SOTA の LLM を比較