Slide 11
Slide 11 text
■ 既存研究 [今井+, JSAI24]ではシーン理解がボトルネック
→MLLMの蒸留により常識的知識を反映した検索を期待
■ 画像中のテキスト情報(dense text)は物体・ランドマーク特定に有用
(e.g., [Bu+, TMM23], [Sun+, ICRA24])
■ 本研究ではこれらを組み合わせる
提案手法:DSME(2/2)
画像全体の説明に基づく構造化された物体表現
Ground-Truth
Rank: 1
既存研究 [今井+, JSAI24]
失敗例:”Please identify what
looks like a schedule list attached
to the wall
at the back of the room.”
dense textの導入によって,hallucination低減+物体理解○
- 11 -
w/o Dense text
The image shows a packaged product with an orange and
white color scheme. ...
There's also a graphic
of a dog's face on the packaging, which could
indicate that
the product is related to
dogs or pet care in some way.
w/ Dense text
The image shows a …. The text includes words such as "otafuku," ...,
and "300g," suggesting that the contents of the package are 300
grams. Other words … imply that the product may be
related to Japanese cuisine, possibly a type of
sauce or condiment. The packaging design
features an illustration of a face, ...
☺「オタフク」
「日本の食事」
「顔のデザイン」
「犬の顔」
「ペット用品」
テキストの誤検出
物体表現の誤りを補完
"オタコク"
"お好み"
"ソース"
https://www.otafuku.co.jp/corp
orate/news/detail/?t_id=144