2024年の大規模言語モデル・視覚言語モデルの動向
2
• 言語・画像・音声・動画を扱うマルチモー
ダルLLM(2024/5/13)
• Windows上のCopilotへ統合される予定
(2024/5/20)
https://www.youtube.com/watch?v=DQacCB9tDaw
プロンプト「Reflections in the window of a train
traveling through the Tokyo suburbs.」
(2024/2/15)
https://openai.com/sora
Sora
GPT-4o
最新のマルチモーダルLLMでも困難な例
参照表現理解
47
The pillow on the couch closest
to the plant in the living room.
Wall picture closest to the front
door in the entryway.
誤ったマスク
対象物体
以外もマスク
複雑な参照表現に対して適切に検索できた
Instruction: “Go to the bathroom with a picture of a wagon and bring me the towel directly across from the sink”
Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6
…
Rank: 1 Rank: 2 Rank: 3 Rank: 4 Rank: 5 Rank: 6
…
Instruction: “Go to the hallway on level 1 that is lined with wine bottles and pull out the high chair closest to the
wine bottles at the second table from the door”