[RSJ24] Object Retrieval in Large-Scale Indoor Environments Using Dense Text with a Multi-Modal Large Language Model

Slide 1

Slide 1 text

Dense Textを用いたマルチモーダルLLMに基づく大規模屋内環境における物体検索慶應義塾大学今井悠人, 是方諒介, 杉浦孔明

Slide 2

Slide 2 text

背景：実世界における任意の物体が検索できれば有用 - 2 - ■ 移動ロボットに物資の運搬を指示できれば倉庫・医療現場などで有用 Open-vocabularyな言語指示から環境中の物体を検索ユースケース ■ 生活支援ロボットによる物体操作 ■ 慣れない環境で物体を探す o 例：学会会場，ショッピングモール 8x - 2 -

Slide 3

Slide 3 text

背景：実世界における任意の物体が検索できれば有用 - 3 - ■ 移動ロボットに物資の運搬を指示できれば倉庫・医療現場などで有用 Open-vocabularyな言語指示から環境中の物体を検索ユースケース ■ 生活支援ロボットによる物体操作 ■ 慣れない環境で物体を探す o 例：学会会場，ショッピングモール 8x - 3 - "オタフク持ってきて" "お好み焼きのソース取ってきて" Dense Text = 画像中のテキスト情報（例：お好み，ソース） + テキストプロンプト ☺ 物体の特定に有用 https://www.otafuku.co.jp/corp orate/news/detail/?t_id=144 https://www.komi.co.jp/ product/2833 https://www.bulldog.co.jp/produ cts/home/item0103_500ml.html

Slide 4

Slide 4 text

出力 ■ ランク付けされた対象物体の矩形領域適切な物体が上位に提示されることが望ましい - 4 - 入力 ■ 参照表現を含む指示文 ■ 画像群問題設定： Learning-to-Rank Physical Objects(LTRPO) [Kaneda+, RA-L24]

Slide 5

Slide 5 text

関連研究：画像中のテキスト情報を考慮した既存の検索手法は少数 RREx-BoT [Siggurdson+, IROS23] 事前巡回を考慮したVLNタスクを扱う  Top-1の物体のみに着目 NLMap-SayCan [Chen+, ICRA23] 事前巡回 + 物体検索に基づく物体操作タスク実行  点群に基づく特徴抽出を前提 MultiRankIt [Kaneda+, RA-L24] Human-in-the-loop設定を想定し，Top-20を検索  家庭内環境など公共性が低く応用先が限定 - 5 - - 5 - NLMap-SayCan MultiRankIt

Slide 6

Slide 6 text

提案手法： Dense textを活用した大規模屋内空間の物体検索モデル - 6 -

Slide 7

Slide 7 text

提案手法： Dense textを活用した大規模屋内空間の物体検索モデル - 7 - Dense Structural Multimodal Encoder (DSME) Dense textを活用し構造的な画像特徴量を獲得

Slide 8

Slide 8 text

提案手法： Dense textを活用した大規模屋内空間の物体検索モデル - 8 - Universal Query Encoder (UQE) 汎用的な文埋め込みモデルにより長いクエリに頑健

Slide 9

Slide 9 text

提案手法： Dense textを活用した大規模屋内空間の物体検索モデル - 9 - Relaxing Contrastive Similarity (RCS) 負例に対する対照性を緩和しつつ学習効率をバランス

Slide 10

Slide 10 text

提案手法：DSME（1/2） MLLM+Dense textに基づく連続・構造的な視覚特徴抽出 - 10 - ◼ Dense textに基づき MLLMから得られる最終層の潜在表現を画像特徴量として導入 ◼ 詳細はAppendixに記載 ◼ CLIP [Radford+, ICML21] に基づく多粒度の特徴と組み合わせて画像特徴量を獲得 - 10 - 画素位置画像物体

Slide 11

Slide 11 text

■ 既存研究 [今井+, JSAI24]ではシーン理解がボトルネック →MLLMの蒸留により常識的知識を反映した検索を期待 ■ 画像中のテキスト情報(dense text)は物体・ランドマーク特定に有用 (e.g., [Bu+, TMM23], [Sun+, ICRA24]) ■ 本研究ではこれらを組み合わせる提案手法：DSME（2/2）画像全体の説明に基づく構造化された物体表現 Ground-Truth Rank: 1 既存研究 [今井+, JSAI24] 失敗例：”Please identify what looks like a schedule list attached to the wall at the back of the room.” dense textの導入によって，hallucination低減+物体理解○ - 11 - w/o Dense text The image shows a packaged product with an orange and white color scheme. ... There's also a graphic of a dog's face on the packaging, which could indicate that the product is related to dogs or pet care in some way. w/ Dense text The image shows a …. The text includes words such as "otafuku," ..., and "300g," suggesting that the contents of the package are 300 grams. Other words … imply that the product may be related to Japanese cuisine, possibly a type of sauce or condiment. The packaging design features an illustration of a face, ... ☺「オタフク」「日本の食事」「顔のデザイン」 「犬の顔」「ペット用品」テキストの誤検出物体表現の誤りを補完 "オタコク" "お好み" "ソース" https://www.otafuku.co.jp/corp orate/news/detail/?t_id=144

Slide 12

Slide 12 text

実験設定：新たに構築したものを含む3つのデータセットで検証 - 12 - ① RefTextデータセット [Bu+, TMM23] を拡張（新規） ■ Dense Textを含む8つの公開ベンチマークの画像 + 参照表現を含む文により構成 ■ 131名のアノテータにより収集 ② YAGAMIデータセット [今井+, JSAI24] ■ 3,000m2にわたる屋内空間から収集した画像群 ③ LTRRIEデータセット [Kaneda+, RA-L24] を拡張 ■ 複数の環境を横断して検索，大規模空間を再現語彙数 56,109 平均文長 15.92 クエリ 3,523 物体領域 3,523 RefText-2.0データセットの概要 - 12 -

Slide 13

Slide 13 text

RefText-2.0データセット YAGAMIデータセット LTRRIE-2.0データセット [%] MRR ↑ R@5↑ MRR ↑ R@5↑ MRR ↑ R@5↑ CLIP [Radford+, ICML21] 42.4 55.0 18.6 15.9 37.0 35.7 NLMap (rep.) [Chen+, ICRA23] 47.7 53.5 22.9 22.5 28.8 21.0 MultiRankIt [Kaneda+, RA-L24] 42.9 58.5 23.4 22.8 36.1 34.0 提案手法 61.5 77.0 28.3 24.2 39.3 37.5 定量的結果：すべてのデータセットにおいて既存手法を上回った - 13 - ■ 評価尺度：Mean Reciprocal Rank (MRR), Recall@5 (R@5) +18.6 - 13 - +18.5 +4.9 +1.4 +3.2 +3.5

Slide 14

Slide 14 text

- 14 - 定性的結果（成功例）： Dense textに該当する物体を上位にランク付けクエリ : ”Bring me a bottle of Turkish oregano with a white label that says Penzeys Spices.” 提案手法 MultiRankIt [Kaneda+, RA-L24] Rank: 1 Rank: 2 Rank: 1 Ground-Truth Rank: 2 'Penzeys Spices'と書かれた瓶が上位 ☺ Dense textに基づく視覚特徴が効果的

Slide 15

Slide 15 text

クエリ：”Pass me the forth bottle from right with the letters PAIN100%.” 提案手法 Rank: 1 Rank: 2 Ground-Truth 定性的結果（失敗例）：部分的に正しい物体を区別できていない提案手法は上位に'bottle'をランク付けしたが，dense text/参照表現を満たさない → dense textの位置情報を含めたプロンプト，ソフトラベルに基づく最適化 - 15 -

Slide 16

Slide 16 text

まとめ - 16 - 大規模屋内環境における物体検索エンジンは有用新規性 - DSME：Dense textに基づく連続・構造的な視覚特徴抽出 - UQE：CLIP，PromCSE，および構文解析に基づく言語特徴抽出機構結果 - 3つのデータセットにおいて既存手法を上回った 4~10x - 16 -

Slide 17

Slide 17 text

Appendix

Slide 18

Slide 18 text

[%] Dense Text PromCSE RefText-2.0 データセット YAGAMI データセット LTRRIE-2.0 データセットモデル MRR ↑ R@5↑ MRR ↑ R@5↑ MRR ↑ R@5↑ (i) ✓ 58.4 72.4 24.9 22.8 31.8 30.0 (ii) ✓ 58.2 73.0 25.2 23.5 37.2 33.0 (iii) ✓ ✓ 61.5 77.0 28.3 24.2 39.3 37.5 Ablation Study：すべての新規性が性能向上に寄与 - 18 - ■ 評価尺度：Mean Reciprocal Rank (MRR), Recall@5 (R@5) - 18 -

Slide 19

Slide 19 text

Appendix：定性的結果（成功例）物体に関する複雑な修飾関係を正確に捕捉 - 19 - クエリ：”Identify the black mechanical device that has been two white cables and two black cables plugged on the top shelf.” 提案手法既存研究 [今井+, JSAI24] Rank: 1 Rank: 1 Ground-Truth 2つの白黒のケーブルが繋がれた端末を検索 ☺ UQE内部のPromCSEに基づく特徴が有効 Rank: 1 - 19 -

Slide 20

Slide 20 text

■ CLIP [Radford+, ICML21] は長文に対し関係性の把握に問題 [Zhang+, ECCV24] →PromCSE [Jiang+, EMNLP22] に基づく文埋め込みを並列に適用 ◼ 既存研究(e.g., [Kaneda+, RA-L24])ではCLIPのみに依存 Appendix：UQE CLIP，PromCSE，および構文解析に基づく特徴抽出機構 - 20 - - 20 -

Slide 21

Slide 21 text

■ InfoNCE損失 [Oord+, 18] は，負例を全て均等に類似度を小さくする  難しい負例(例：類似物体)と簡単な負例を同一視 ■ 大規模環境の場合，正例の類似物体が含まれる確率が高くなる ■ Relaxed Contrastive (ReCo) [Lin+, WACV23] 損失はこの問題を軽減  類似度が負の空間に対して計算が行われない Go to the yellow trash can in the center lined up near the windows. Appendix：RCS（1/2）大規模空間のための負例の対照性を緩和する損失設計 - 21 - 21 -

Slide 22

Slide 22 text

■ InfoNCE損失 [] は，負例を全て均等に類似度を小さくする o :( 難しい負例(e.g. 類似物体)と簡単な負例を同一視 ■ 大規模環境の場合，正例との類似物体が含まれる確率が高くなる ■ ReCo(Relaxed Contrastive) [Lin+, WACV23] 損失はこの問題を軽減 :( 類似度が負の空間に対し計算が行われない Appendix：RCS（2/2）大規模空間のための負例の対照性を緩和する損失設計 [Lin+, WACV23] InfoNCEはすべての負例を均等に扱う ReCoは最初から類似度が負の領域を無視これらを併用した損失（Mixed Contrastive Loss）を導入 - 22 -

Slide 23

Slide 23 text

Appendix：GREP 4つの粒度から接地に有用な画像特徴量を抽出 - 23 - • 画像/画素/位置関係/物体の観点から接地に有用な特徴を獲得 • CLIPから位置関係に関する特徴を得る →中間層特徴マップ（2D）を使用 • CLIPから画素単位で特徴を抽出 cf. SAN [Xu+, CVPR23]

Slide 24

Slide 24 text

■ 既存研究 [今井+, JSAI24]ではシーン理解がボトルネック →MLLMの蒸留により常識的知識を反映した検索を期待 ■ 画像中のテキスト情報(dense text)は物体・ランドマーク特定に有用 (e.g., [Bu+, TMM23], [Sun+, ICRA24]) ■ 本研究ではこれらを組み合わせる Appendix： Dense textから得られる専門知識の付加によるシーン理解 Ground-Truth Rank: 1 既存研究 [今井+, JSAI24] 失敗例：”Please identify what looks like a schedule list attached to the wall at the back of the room.” プロンプトにdense textを付加することで，hallucination低減+専門知識 w/o Dense text The image shows a collection of medication bottles. Starting from the upper left and moving clockwise, the first bottle is … The seventh bottle is labeled 'Kumadex' and ... "HIVIRAL" "DUVIRAL" "LAMIVUDINE" - 24 - w/o Dense text The image shows a collection of medication bottles. Starting from the upper left and moving clockwise, the first bottle is … The seventh bottle is labeled 'Kumadex' and ... w/ Dense text The image shows a collection of medication bottles, ... The medications include 'Duviral Lamivudine Zidovudine,' 'Reviral,' and 'Hiviral Lamivudine.' ... The bottles appear to be pharmaceutical products, possibly used for treating HIV/AIDS or related conditions.

Slide 25

Slide 25 text

■ Mean Reciprocal Rank ■ Recall@K Appendix：評価尺度：クエリの数：正解である物体のうち最上位のランク：物体の集合：検索上位K個の物体集合 - 25 -

Slide 26

Slide 26 text

Appendix：使用したプロンプトの詳細 ■ LLaVA-NeXT [Liu+, 24] に対しプロンプトとして以下を与えた “Describe the objects in this image in detail, including its positional relationship to surrounding objects. Detected OCR results are following: {ocr_texts}” - 26 -